From a quick glance of SparkStrategies.scala , when statistics.sizeInBytes of the LogicalPlan is <= autoBroadcastJoinThreshold, the plan's output would be used in broadcast join as the 'build' relation.

FYI

On Mon, Aug 10, 2015 at 8:04 AM, Srikanth <srikanth.ht@gmail.com> wrote:
SizeEstimator.estimate(df) will not give the size of dataframe rt? I think it will give size of df object.

With RDD, I sample() and collect() and sum size of each row. If I do the same with dataframe it will no longer be size when represented in columnar format.

I'd also like to know how spark.sql.autoBroadcastJoinThreshold estimates size of dataframe. Is it going to broadcast when columnar storage size is less that 10 MB?

Srikanth


On Fri, Aug 7, 2015 at 2:51 PM, Ted Yu <yuzhihong@gmail.com> wrote:
Have you tried calling SizeEstimator.estimate() on a DataFrame ?

I did the following in REPL:

scala> SizeEstimator.estimate(df)
res1: Long = 17769680

FYI

On Fri, Aug 7, 2015 at 6:48 AM, Srikanth <srikanth.ht@gmail.com> wrote:
Hello,

Is there a way to estimate the approximate size of a dataframe? I know we can cache and look at the size in UI but I'm trying to do this programatically. With RDD, I can sample and sum up size using SizeEstimator. Then extrapolate it to the entire RDD. That will give me approx size of RDD. With dataframes, its tricky due to columnar storage. How do we do it?

On a related note, I see size of RDD object to be ~60MB. Is that the footprint of RDD in driver JVM?

scala> val temp = sc.parallelize(Array(1,2,3,4,5,6))
scala> SizeEstimator.estimate(temp)
res13: Long = 69507320

Srikanth