spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zhangliyun <>
Subject Re:Re: A question about broadcast nest loop join
Date Wed, 23 Oct 2019 21:57:33 GMT

Hi all:

    From google , I know that:

    Spark can only pick BroadcastNestedLoopJoin to implement left/right join. 
    but why I use  following case , broascastnestedLoopJoin became Sortmerged join when set

set spark.sql.autoBroadcastJoinThreshold=-1;
 explain select a.key1, b.key3 from testdata1 as a  left join testdata3  as b where a.value1=b.value3;
== Physical Plan ==
*(5) Project [key1#18, key3#20]
+- *(5) SortMergeJoin [value1#19], [value3#21], Inner
   :- *(2) Sort [value1#19 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(value1#19, 200)
   :     +- *(1) Filter isnotnull(value1#19)
   :        +- HiveTableScan [key1#18, value1#19], HiveTableRelation `default`.`testdata1`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#18, value1#19]
   +- *(4) Sort [value3#21 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(value3#21, 200)
         +- *(3) Filter isnotnull(value3#21)
            +- HiveTableScan [key3#20, value3#21], HiveTableRelation `default`.`testdata3`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#20, value3#21]


Appreciate to know your suggestion

Best Regards

Kelly Zhang

在 2019-10-23 22:12:35,"angers.zhu" <> 写道:

where not in ( query block)
condition will been change to LeftSemi join in optimizer rule RewritePredicateSubquery. 
Then as cloud-fan said,  it will be change to a BroadCastNestLoopJoin
| |

On 10/23/2019 20:55,Wenchen Fan<> wrote:
I haven't looked into your query yet, just want to let you know that: Spark can only pick
BroadcastNestedLoopJoin to implement left/right join. If the table is very big, then OOM happens.

Maybe there is an algorithm to implement left/right join in a distributed environment without
broadcast, but currently Spark is only able to deal with it using broadcast. 

On Wed, Oct 23, 2019 at 6:02 PM zhangliyun <> wrote:

Hi all: 
i want to ask a question about broadcast nestloop join? from google i know, that 
 left outer/semi join and right outer/semi join will use broadcast nestloop.
  and in some cases, when the input data is very small, it is suitable to use. so here
  how to define the input data very small? what parameter decides the threshold?  I just want
to disable it ( i found that   set spark.sql.autoBroadcastJoinThreshold= -1 is no work for
sql:select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3) )


explain cost select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3);

== Physical Plan ==
*(1) Project [key1#90]
+- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) || isnull((key1#90 =
   :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
[key1#90, value1#91]
   +- BroadcastExchange IdentityBroadcastMode
      +- HiveTableScan [key3#92], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
[key3#92, value3#93]


  my question is 
  1. why in not in subquery , BroadcastNestedLoopJoin is still used even i set spark.sql.autoBroadcastJoinThreshold=
  2. which spark parameter  decides enable/disable BroadcastNestedLoopJoin.

Appreciate if you have suggestion

Best Regards

Kelly Zhang

View raw message