hash. Sort-based shuffle is more memory-efficient and is the default option starting in 1.2.
You can try set “spark.sql.join.preferSortMergeJoin” cons option to false.For detailed join strategies, take a look at the source code of SparkStrategies.scala:/**
* Select the proper physical plan for join based on joining keys and size of logical plan.
* At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at least some of the
* predicates can be evaluated by matching join keys. If found, Join implementations are chosen
* with the following precedence:
* - Broadcast: if one side of the join has an estimated physical size that is smaller than the
* user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold
* or if that side has an explicit broadcast hint (e.g. the user applied the
* [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), then that side
* of the join will be broadcasted and the other side will be streamed, with no shuffling
* performed. If both sides of the join are eligible to be broadcasted then the
* - Shuffle hash join: if the average size of a single partition is small enough to build a hash
* - Sort merge: if the matching join keys are sortable.
* If there is no joining keys, Join implementations are chosen with the following precedence:
* - BroadcastNestedLoopJoin: if one side of the join could be broadcasted
* - CartesianProduct: for Inner join
* - BroadcastNestedLoopJoin
*/On Jul 5, 2016, at 13:28, Lalitha MV <email@example.com> wrote:It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is set to -1, or when the size of the small table is more than spark.sql.spark.sql.autoBroadcastJoinThreshold.On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro <firstname.lastname@example.org> wrote:The join selection can be described in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92.If you have join keys, you can set -1 at `spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then, hash joins are used in queries.// maropu--On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV <email@example.com> wrote:Hi maropu,Thanks for your reply.Would it be possible to write a rule for this, to make it always pick shuffle hash join, over other join implementations(i.e. sort merge and broadcast)?Is there any documentation demonstrating rule based transformation for physical plan trees?Thanks,Lalitha--On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro <firstname.lastname@example.org> wrote:Hi,No, spark has no hint for the hash join.// maropu--On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV <email@example.com> wrote:Hi,In order to force broadcast hash join, we can set the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce shuffle hash join in spark sql?Thanks,Lalitha---