spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Vrba (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-25401) Reorder the required ordering to match the table's output ordering for bucket join
Date Fri, 07 Dec 2018 05:19:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-25401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712350#comment-16712350
] 

David Vrba commented on SPARK-25401:
------------------------------------

I was looking at it and i believe that it the class EnsureRequirements we could reorder the
join predicates for SortMergeJoin once more - just before we check if child outputOrdering
satisfies the requiredOrdering - and we can align the predicate keys with the child outputOrdering.
In such case it is not going to add the unnecessary SortExec and also it is not going to add
unnecessary Exchange either, because Exchange is handled before.

 

What do you guys think? Is it a good approach? (Please be patient with me, this is my first
Jira on Spark)

> Reorder the required ordering to match the table's output ordering for bucket join
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-25401
>                 URL: https://issues.apache.org/jira/browse/SPARK-25401
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Wang, Gang
>            Priority: Major
>
> Currently, we check if SortExec is needed between a operator and its child operator
in method orderingSatisfies, and method orderingSatisfies require the order in the SortOrders
are all the same.
> While, take the following case into consideration.
>  * Table a is bucketed by (a1, a2), sorted by (a2, a1), and buckets number is 200.
>  * Table b is bucketed by (b1, b2), sorted by (b2, b1), and buckets number is 200.
>  * Table a join table b on (a1=b1, a2=b2)
> In this case, if the join is sort merge join, the query planner won't add exchange on
both sides, while, sort will be added on both sides. Actually, sort is also unnecessary, since
in the same bucket, like bucket 1 of table a, and bucket 1 of table b, (a1=b1, a2=b2) is equivalent
to (a2=b2, a1=b1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message