spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ruslan Dautkhanov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
Date Fri, 28 Jul 2017 15:44:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105127#comment-16105127
] 

Ruslan Dautkhanov commented on SPARK-21274:
-------------------------------------------

[~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the queries. 

I don't think [1, 2] is the correct behavior for the first query.
EXCEPT ALL which returns all records from the *first* table which are not present in the second
table, leaving the duplicates as is.



> Implement EXCEPT ALL and INTERSECT ALL
> --------------------------------------
>
>                 Key: SPARK-21274
>                 URL: https://issues.apache.org/jira/browse/SPARK-21274
>             Project: Spark
>          Issue Type: New Feature
>          Components: Optimizer, SQL
>    Affects Versions: 2.0.0, 2.1.0, 2.2.0
>            Reporter: Ruslan Dautkhanov
>              Labels: set, sql
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROM    tab1 t1
>      LEFT OUTER JOIN 
>         tab2 t2
>      ON (
>         (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>      )
> WHERE
>     COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name that can be
also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined above:
> {code}
> SELECT a,b,c
> FROM    tab1 t1
> WHERE 
>    NOT EXISTS
>    (    SELECT 1
>         FROM    t1_except_t2_df e
>         WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>    )
> {code}
> So the suggestion is just to use above query rewrites to implement both EXCEPT ALL and
INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message