spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gen tang <>
Subject Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold
Date Tue, 11 Aug 2015 14:11:35 GMT

Recently, I use spark sql to do join on non-equality condition, condition1
or condition2 for example.

Spark will use broadcastNestedLoopJoin to do this. Assume that one of
dataframe(df1) is not created from hive table nor local collection and the
other one is created from hivetable(df2). For df1, spark will use
defaultSizeInBytes * length of df1 to estimate the size of df1 and use
correct size for df2.

As the result, in most cases, spark will think df1 is bigger than df2 even
df2 is really huge. And spark will do df2.collect(), which will cause error
or slowness of program.

Maybe we should just use defaultSizeInBytes for logicalRDD, not
defaultSizeInBytes * length?

Hope this could be helpful
Thanks a lot in advance for your help and input.


View raw message