spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Srikanth <srikanth...@gmail.com>
Subject Re: Estimate size of Dataframe programatically
Date Mon, 10 Aug 2015 15:04:43 GMT
SizeEstimator.estimate(df) will not give the size of dataframe rt? I think
it will give size of df object.

With RDD, I sample() and collect() and sum size of each row. If I do the
same with dataframe it will no longer be size when represented in columnar
format.

I'd also like to know how spark.sql.autoBroadcastJoinThreshold estimates
size of dataframe. Is it going to broadcast when columnar storage size is
less that 10 MB?

Srikanth


On Fri, Aug 7, 2015 at 2:51 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> Have you tried calling SizeEstimator.estimate() on a DataFrame ?
>
> I did the following in REPL:
>
> scala> SizeEstimator.estimate(df)
> res1: Long = 17769680
>
> FYI
>
> On Fri, Aug 7, 2015 at 6:48 AM, Srikanth <srikanth.ht@gmail.com> wrote:
>
>> Hello,
>>
>> Is there a way to estimate the approximate size of a dataframe? I know we
>> can cache and look at the size in UI but I'm trying to do this
>> programatically. With RDD, I can sample and sum up size using
>> SizeEstimator. Then extrapolate it to the entire RDD. That will give me
>> approx size of RDD. With dataframes, its tricky due to columnar storage.
>> How do we do it?
>>
>> On a related note, I see size of RDD object to be ~60MB. Is that the
>> footprint of RDD in driver JVM?
>>
>> scala> val temp = sc.parallelize(Array(1,2,3,4,5,6))
>> scala> SizeEstimator.estimate(temp)
>> res13: Long = 69507320
>>
>> Srikanth
>>
>
>

Mime
View raw message