spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Estimate size of Dataframe programatically
Date Mon, 10 Aug 2015 15:23:45 GMT
>From a quick glance of SparkStrategies.scala , when statistics.sizeInBytes
of the LogicalPlan is <= autoBroadcastJoinThreshold, the plan's output
would be used in broadcast join as the 'build' relation.

FYI

On Mon, Aug 10, 2015 at 8:04 AM, Srikanth <srikanth.ht@gmail.com> wrote:

> SizeEstimator.estimate(df) will not give the size of dataframe rt? I think
> it will give size of df object.
>
> With RDD, I sample() and collect() and sum size of each row. If I do the
> same with dataframe it will no longer be size when represented in columnar
> format.
>
> I'd also like to know how spark.sql.autoBroadcastJoinThreshold estimates
> size of dataframe. Is it going to broadcast when columnar storage size is
> less that 10 MB?
>
> Srikanth
>
>
> On Fri, Aug 7, 2015 at 2:51 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>> Have you tried calling SizeEstimator.estimate() on a DataFrame ?
>>
>> I did the following in REPL:
>>
>> scala> SizeEstimator.estimate(df)
>> res1: Long = 17769680
>>
>> FYI
>>
>> On Fri, Aug 7, 2015 at 6:48 AM, Srikanth <srikanth.ht@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Is there a way to estimate the approximate size of a dataframe? I know
>>> we can cache and look at the size in UI but I'm trying to do this
>>> programatically. With RDD, I can sample and sum up size using
>>> SizeEstimator. Then extrapolate it to the entire RDD. That will give me
>>> approx size of RDD. With dataframes, its tricky due to columnar storage.
>>> How do we do it?
>>>
>>> On a related note, I see size of RDD object to be ~60MB. Is that the
>>> footprint of RDD in driver JVM?
>>>
>>> scala> val temp = sc.parallelize(Array(1,2,3,4,5,6))
>>> scala> SizeEstimator.estimate(temp)
>>> res13: Long = 69507320
>>>
>>> Srikanth
>>>
>>
>>
>

Mime
View raw message