spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sun Rui <sunrise_...@163.com>
Subject Re: Enforcing shuffle hash join
Date Tue, 05 Jul 2016 05:56:25 GMT
You can try set “spark.sql.join.preferSortMergeJoin” cons option to false.

For detailed join strategies, take a look at the source code of SparkStrategies.scala:
/**
 * Select the proper physical plan for join based on joining keys and size of logical plan.
 *
 * At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at least some of
the
 * predicates can be evaluated by matching join keys. If found,  Join implementations are
chosen
 * with the following precedence:
 *
 * - Broadcast: if one side of the join has an estimated physical size that is smaller than
the
 *     user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold
 *     or if that side has an explicit broadcast hint (e.g. the user applied the
 *     [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), then that
side
 *     of the join will be broadcasted and the other side will be streamed, with no shuffling
 *     performed. If both sides of the join are eligible to be broadcasted then the
 * - Shuffle hash join: if the average size of a single partition is small enough to build
a hash
 *     table.
 * - Sort merge: if the matching join keys are sortable.
 *
 * If there is no joining keys, Join implementations are chosen with the following precedence:
 * - BroadcastNestedLoopJoin: if one side of the join could be broadcasted
 * - CartesianProduct: for Inner join
 * - BroadcastNestedLoopJoin
 */


> On Jul 5, 2016, at 13:28, Lalitha MV <lalithamv92@gmail.com> wrote:
> 
> It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is set to -1, or
when the size of the small table is more than spark.sql.spark.sql.autoBroadcastJoinThreshold.
> 
> On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro <linguin.m.s@gmail.com <mailto:linguin.m.s@gmail.com>>
wrote:
> The join selection can be described in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92>.
> If you have join keys, you can set -1 at `spark.sql.autoBroadcastJoinThreshold` to disable
broadcast joins. Then, hash joins are used in queries.
> 
> // maropu 
> 
> On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV <lalithamv92@gmail.com <mailto:lalithamv92@gmail.com>>
wrote:
> Hi maropu, 
> 
> Thanks for your reply. 
> 
> Would it be possible to write a rule for this, to make it always pick shuffle hash join,
over other join implementations(i.e. sort merge and broadcast)? 
> 
> Is there any documentation demonstrating rule based transformation for physical plan
trees? 
> 
> Thanks,
> Lalitha
> 
> On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro <linguin.m.s@gmail.com <mailto:linguin.m.s@gmail.com>>
wrote:
> Hi,
> 
> No, spark has no hint for the hash join.
> 
> // maropu
> 
> On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV <lalithamv92@gmail.com <mailto:lalithamv92@gmail.com>>
wrote:
> Hi, 
> 
> In order to force broadcast hash join, we can set the spark.sql.autoBroadcastJoinThreshold
config. Is there a way to enforce shuffle hash join in spark sql? 
> 
> 
> Thanks,
> Lalitha
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro
> 
> 
> 
> -- 
> Regards,
> Lalitha
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro
> 
> 
> 
> -- 
> Regards,
> Lalitha


Mime
View raw message