spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiao Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-12602) Join Reordering: Pushing Inner Join Through Outer Join
Date Tue, 12 Jan 2016 02:14:40 GMT

    [ https://issues.apache.org/jira/browse/SPARK-12602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093158#comment-15093158
] 

Xiao Li commented on SPARK-12602:
---------------------------------

Sure, Thanks!

> Join Reordering: Pushing Inner Join Through Outer Join
> ------------------------------------------------------
>
>                 Key: SPARK-12602
>                 URL: https://issues.apache.org/jira/browse/SPARK-12602
>             Project: Spark
>          Issue Type: Improvement
>          Components: Optimizer, SQL
>    Affects Versions: 1.6.0
>            Reporter: Xiao Li
>            Priority: Critical
>
> If applicable, we can push Inner Join through Outer Join. The basic idea is built on
the associativity property of outer and inner joins:
> {code}
> R1 inner (R2 left R3 on p23) on p12 = (R1 inner R2 on p12) left R3 on p23
> R1 inner (R2 right R3 on p23) on p13 = R2 right (R1 inner R3 on p13) on p23 = (R1 inner
R3 on p13) left R2 on p23
> (R1 left R2 on p12) inner R3 on p13 = (R1 inner R3 on p13) left R2 on p12
> (R1 right R2 on p12) inner R3 on p23 = R1 right (R2 inner R3 on p23) on p12 = (R2 inner
R3 on p23) left R1 on p12
> {code}
> The reordering can reduce the number of processed rows since the Inner Join always can
generate less (or equivalent) rows than Left/Right Outer Join. This change can improve the
query performance in most cases.
> When cost-based optimization is available, we can switch the order of tables in each
join type based on their costs. The order of joined tables in the inner join does not affect
the results and the right outer join can be changed to the left outer join. This part is out
of scope here.
> For example, given the following eligible query:
> {code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === $"b.int", "inner"){code}
> Before the fix, the logical plan is like
> {code}
> Join Inner, Some((int#15 = int#9))
> :- Join RightOuter, Some((int#3 = int#9))
> :  :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
> :  +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
> +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
> {code}
> After the fix, the logical plan is like
> {code}
> Join LeftOuter, Some((int#3 = int#9))
> :- Join Inner, Some((int#15 = int#9))
> :  :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
> :  +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
> +- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message