spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 吴晓菊 <chrysan...@gmail.com>
Subject Re: why BroadcastHashJoinExec is not implemented with outputOrdering?
Date Fri, 29 Jun 2018 02:43:10 GMT
And it should be generic for HashJoin not only broadcast join, right?


Chrysan Wu
吴晓菊
Phone:+86 17717640807


2018-06-29 10:42 GMT+08:00 吴晓菊 <chrysanxia@gmail.com>:

> Sorry for the mistake. You are right output ordering of broadcast join can
> be the order of big table in some types of join. I will prepare a PR and
> let you review later. Thanks a lot!
>
>
> Chrysan Wu
> 吴晓菊
> Phone:+86 17717640807
>
>
> 2018-06-29 0:00 GMT+08:00 Wenchen Fan <cloud0fan@gmail.com>:
>
>> SortMergeJoin sorts its children by join key, but broadcast join does
>> not. I think the output ordering of broadcast join has nothing to do with
>> join key.
>>
>> On Thu, Jun 28, 2018 at 11:28 PM Marco Gaido <marcogaido91@gmail.com>
>> wrote:
>>
>>> I think the outputOrdering would be the one of the big table (if any)
>>> and it wouldn't matter if this involves the join keys or not. Am I wrong?
>>>
>>> 2018-06-28 17:01 GMT+02:00 吴晓菊 <chrysanxia@gmail.com>:
>>>
>>>> Thanks for the reply.
>>>> By looking into the SortMergeJoinExec, I think we can follow what
>>>> SortMergeJoin do, for some types of join, if the children is ordered on
>>>> join keys, we can output the ordered join keys as output ordering.
>>>>
>>>>
>>>> Chrysan Wu
>>>> 吴晓菊
>>>> Phone:+86 17717640807
>>>>
>>>>
>>>> 2018-06-28 22:53 GMT+08:00 Wenchen Fan <cloud0fan@gmail.com>:
>>>>
>>>>> SortMergeJoin only reports ordering of the join keys, not the output
>>>>> ordering of any child.
>>>>>
>>>>> It seems reasonable to me that broadcast join should respect the
>>>>> output ordering of the children. Feel free to submit a PR to fix it,
thanks!
>>>>>
>>>>> On Thu, Jun 28, 2018 at 10:07 PM 吴晓菊 <chrysanxia@gmail.com>
wrote:
>>>>>
>>>>>> Why we cannot use the output order of big table?
>>>>>>
>>>>>>
>>>>>> Chrysan Wu
>>>>>> Phone:+86 17717640807
>>>>>>
>>>>>>
>>>>>> 2018-06-28 21:48 GMT+08:00 Marco Gaido <marcogaido91@gmail.com>:
>>>>>>
>>>>>>> The easy answer to this is that SortMergeJoin ensure an
>>>>>>> outputOrdering, while BroadcastHashJoin doesn't, ie. after running
a
>>>>>>> BroadcastHashJoin you don't know which is going to be the order
of the
>>>>>>> output since nothing enforces it.
>>>>>>>
>>>>>>> Hope this helps.
>>>>>>> Thanks.
>>>>>>> Marco
>>>>>>>
>>>>>>> 2018-06-28 15:46 GMT+02:00 吴晓菊 <chrysanxia@gmail.com>:
>>>>>>>
>>>>>>>>
>>>>>>>> We see SortMergeJoinExec is implemented with
>>>>>>>> outputPartitioning&outputOrdering while BroadcastHashJoinExec
is
>>>>>>>> only implemented with outputPartitioning. Why is the design?
>>>>>>>>
>>>>>>>> Chrysan Wu
>>>>>>>> Phone:+86 17717640807
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>

Mime
View raw message