spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Estimate size of Dataframe programatically
Date Fri, 07 Aug 2015 18:51:33 GMT
Have you tried calling SizeEstimator.estimate() on a DataFrame ?

I did the following in REPL:

scala> SizeEstimator.estimate(df)
res1: Long = 17769680

FYI

On Fri, Aug 7, 2015 at 6:48 AM, Srikanth <srikanth.ht@gmail.com> wrote:

> Hello,
>
> Is there a way to estimate the approximate size of a dataframe? I know we
> can cache and look at the size in UI but I'm trying to do this
> programatically. With RDD, I can sample and sum up size using
> SizeEstimator. Then extrapolate it to the entire RDD. That will give me
> approx size of RDD. With dataframes, its tricky due to columnar storage.
> How do we do it?
>
> On a related note, I see size of RDD object to be ~60MB. Is that the
> footprint of RDD in driver JVM?
>
> scala> val temp = sc.parallelize(Array(1,2,3,4,5,6))
> scala> SizeEstimator.estimate(temp)
> res13: Long = 69507320
>
> Srikanth
>

Mime
View raw message