spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zhangliyun <kelly...@126.com>
Subject Re:Re: A question about broadcast nest loop join
Date Wed, 23 Oct 2019 22:29:49 GMT


Hi Herman:
   I guess what you mentioned before 
```
if you are OK with slightly different NULL semantics then you could use NOT EXISTS(subquery).
The latter should perform a lot better.


```
is the NULL key1 of  left table will be retained if NULL key2 is not found in the right table
 ( join condition :  left.key1 = right.key2)  in exists semantics while this will not happen
in 
"in semantics". If my understanding wrong, tell me.






Best Regards.


Kelly Zhang
















在 2019-10-23 19:16:34,"Herman van Hovell" <herman@databricks.com> 写道:

In some cases BroadcastNestedLoopJoin is the only viable join method. In your example for
instance you are using a non-equi join condition and BNLJ is the only method that works in
that case. This is also the reason why you can't disable it using the spark.sql.autoBroadcastJoinThreshold
configuration.


Such a plan is generally generated by using a NOT IN (subquery), if you are OK with slightly
different NULL semantics then you could use NOT EXISTS(subquery). The latter should perform
a lot better.


On Wed, Oct 23, 2019 at 12:02 PM zhangliyun <kellyzly@126.com> wrote:

Hi all: 
i want to ask a question about broadcast nestloop join? from google i know, that 
 left outer/semi join and right outer/semi join will use broadcast nestloop.
  and in some cases, when the input data is very small, it is suitable to use. so here
  how to define the input data very small? what parameter decides the threshold?  I just want
to disable it ( i found that   set spark.sql.autoBroadcastJoinThreshold= -1 is no work for
sql:select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3) )




```


explain cost select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3);


== Physical Plan ==
*(1) Project [key1#90]
+- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) || isnull((key1#90 =
key3#92)))
   :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
[key1#90, value1#91]
   +- BroadcastExchange IdentityBroadcastMode
      +- HiveTableScan [key3#92], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
[key3#92, value3#93]


```


  my question is 
  1. why in not in subquery , BroadcastNestedLoopJoin is still used even i set spark.sql.autoBroadcastJoinThreshold=
-1 
  2. which spark parameter  decides enable/disable BroadcastNestedLoopJoin.






Appreciate if you have suggestion




Best Regards


Kelly Zhang




 
Mime
View raw message