spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "喜之郎" <>
Subject 回复: Enforcing shuffle hash join
Date Tue, 05 Jul 2016 08:59:46 GMT
you can try set "spark.shuffle.manager" to "hash".
this is the meaning of the parameter:
Implementation to use for shuffling data. There are two implementations available:sort and
hash. Sort-based shuffle is more memory-efficient and is the default option starting in 1.2.

------------------ 原始邮件 ------------------
发件人: "Lalitha MV";<>;
发送时间: 2016年7月5日(星期二) 下午2:44
收件人: "Sun Rui"<>; 
抄送: "Takeshi Yamamuro"<>; ""<>;

主题: Re: Enforcing shuffle hash join

By setting the preferSortMergeJoin to false, it still only picks between Merge Join and Broadcast
join. Does not pick shuffle hash join depending on autobroadcastthreshold's value.
I went though the sparkstrategies, and doesn't look like there is a direct clean way to enforce

On Mon, Jul 4, 2016 at 10:56 PM, Sun Rui <> wrote:
You can try set “spark.sql.join.preferSortMergeJoin” cons option to false.

For detailed join strategies, take a look at the source code of SparkStrategies.scala:
 * Select the proper physical plan for join based on joining keys and size of logical plan.
 * At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at least some of
 * predicates can be evaluated by matching join keys. If found,  Join implementations are
 * with the following precedence:
 * - Broadcast: if one side of the join has an estimated physical size that is smaller than
 *     user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold
 *     or if that side has an explicit broadcast hint (e.g. the user applied the
 *     [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), then that
 *     of the join will be broadcasted and the other side will be streamed, with no shuffling
 *     performed. If both sides of the join are eligible to be broadcasted then the
 * - Shuffle hash join: if the average size of a single partition is small enough to build
a hash
 *     table.
 * - Sort merge: if the matching join keys are sortable.
 * If there is no joining keys, Join implementations are chosen with the following precedence:
 * - BroadcastNestedLoopJoin: if one side of the join could be broadcasted
 * - CartesianProduct: for Inner join
 * - BroadcastNestedLoopJoin

On Jul 5, 2016, at 13:28, Lalitha MV <> wrote:

It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is set to -1, or when
the size of the small table is more than spark.sql.spark.sql.autoBroadcastJoinThreshold.

On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro <> wrote:
The join selection can be described in
If you have join keys, you can set -1 at `spark.sql.autoBroadcastJoinThreshold` to disable
broadcast joins. Then, hash joins are used in queries.

// maropu 

On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV <> wrote:
Hi maropu, 

Thanks for your reply. 

Would it be possible to write a rule for this, to make it always pick shuffle hash join, over
other join implementations(i.e. sort merge and broadcast)? 

Is there any documentation demonstrating rule based transformation for physical plan trees?


On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro <> wrote:

No, spark has no hint for the hash join.

// maropu

On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV <> wrote:

In order to force broadcast hash join, we can set the spark.sql.autoBroadcastJoinThreshold
config. Is there a way to enforce shuffle hash join in spark sql? 




Takeshi Yamamuro




Takeshi Yamamuro




View raw message