spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: Re: A question about broadcast nest loop join
Date Thu, 24 Oct 2019 05:13:35 GMT
Ah sorry I made a mistake. "Spark can only pick BroadcastNestedLoopJoin to
implement left/right join" this should be "left/right non-equal join"

On Thu, Oct 24, 2019 at 6:32 AM zhangliyun <kellyzly@126.com> wrote:

>
> Hi Herman:
>    I guess what you mentioned before
> ```
> if you are OK with slightly different NULL semantics then you could use NOT
> EXISTS(subquery). The latter should perform a lot better.
>
> ```
> is the NULL key1 of  left table will be retained if NULL key2 is not found
> in the right table  ( join condition :  left.key1 = right.key2)  in exists
> semantics while this will not happen in
> "in semantics". If my understanding wrong, tell me.
>
>
>
> Best Regards.
>
> Kelly Zhang
>
>
>
>
>
>
>
>
> 在 2019-10-23 19:16:34,"Herman van Hovell" <herman@databricks.com> 写道:
>
> In some cases BroadcastNestedLoopJoin is the only viable join method. In
> your example for instance you are using a non-equi join condition and BNLJ
> is the only method that works in that case. This is also the reason why you
> can't disable it using the spark.sql.autoBroadcastJoinThreshold
> configuration.
>
> Such a plan is generally generated by using a NOT IN (subquery), if you
> are OK with slightly different NULL semantics then you could use NOT
> EXISTS(subquery). The latter should perform a lot better.
>
> On Wed, Oct 23, 2019 at 12:02 PM zhangliyun <kellyzly@126.com> wrote:
>
>> Hi all:
>> i want to ask a question about broadcast nestloop join? from google i
>> know, that
>>  left outer/semi join and right outer/semi join will use broadcast
>> nestloop.
>>   and in some cases, when the input data is very small, it is suitable to
>> use. so here
>>   how to define the input data very small? what parameter decides the
>> threshold?  I just want to disable it ( i found that   set
>> spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1
>>  from testdata1 as a where a.key1 not in (select key3 from testdata3) )
>>
>>
>> ```
>>
>> explain cost select a.key1  from testdata1 as a where a.key1 not in
>> (select key3 from testdata3);
>>
>> == Physical Plan ==
>> *(1) Project [key1#90]
>> +- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) ||
>> isnull((key1#90 = key3#92)))
>>    :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`,
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]
>>    +- BroadcastExchange IdentityBroadcastMode
>>       +- HiveTableScan [key3#92], HiveTableRelation
>> `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>> [key3#92, value3#93]
>>
>> ```
>>
>>   my question is
>>   1. why in not in subquery , BroadcastNestedLoopJoin is still used even
>> i set spark.sql.autoBroadcastJoinThreshold= -1
>>   2. which spark parameter  decides enable/disable
>> BroadcastNestedLoopJoin.
>>
>>
>>
>> Appreciate if you have suggestion
>>
>>
>> Best Regards
>>
>> Kelly Zhang
>>
>>
>>
>>
>
>
>
>

Mime
View raw message