From a quick glance of SparkStrategies.scala , when statistics.sizeInBytes of the LogicalPlan is <= autoBroadcastJoinThreshold, the plan's output would be used in broadcast join as the 'build' relation.


On Mon, Aug 10, 2015 at 8:04 AM, Srikanth <> wrote:
SizeEstimator.estimate(df) will not give the size of dataframe rt? I think it will give size of df object.

With RDD, I sample() and collect() and sum size of each row. If I do the same with dataframe it will no longer be size when represented in columnar format.

I'd also like to know how spark.sql.autoBroadcastJoinThreshold estimates size of dataframe. Is it going to broadcast when columnar storage size is less that 10 MB?


On Fri, Aug 7, 2015 at 2:51 PM, Ted Yu <> wrote:
Have you tried calling SizeEstimator.estimate() on a DataFrame ?

I did the following in REPL:

scala> SizeEstimator.estimate(df)
res1: Long = 17769680


On Fri, Aug 7, 2015 at 6:48 AM, Srikanth <> wrote:

Is there a way to estimate the approximate size of a dataframe? I know we can cache and look at the size in UI but I'm trying to do this programatically. With RDD, I can sample and sum up size using SizeEstimator. Then extrapolate it to the entire RDD. That will give me approx size of RDD. With dataframes, its tricky due to columnar storage. How do we do it?

On a related note, I see size of RDD object to be ~60MB. Is that the footprint of RDD in driver JVM?

scala> val temp = sc.parallelize(Array(1,2,3,4,5,6))
scala> SizeEstimator.estimate(temp)
res13: Long = 69507320