spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ruslan Dautkhanov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
Date Fri, 28 Jul 2017 15:47:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105127#comment-16105127
] 

Ruslan Dautkhanov edited comment on SPARK-21274 at 7/28/17 3:46 PM:
--------------------------------------------------------------------

[~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the queries. 

I don't think {noformat}[1, 2]{noformat} is the correct behavior for the first query.

EXCEPT ALL which returns all records from the *first* table which are not present in the second
table, leaving the duplicates as is.

If you believe it should be "1,2", then it's easy to fix by just changing tab1 to tab2 in
the second query.

Or other way around, original queries would return 
{noformat}
[1, 2]
for [1, 2] intersect_all [1, 2, 2]
{noformat}


was (Author: tagar):
[~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the queries. 

I don't think [1, 2] is the correct behavior for the first query.
EXCEPT ALL which returns all records from the *first* table which are not present in the second
table, leaving the duplicates as is.



> Implement EXCEPT ALL and INTERSECT ALL
> --------------------------------------
>
>                 Key: SPARK-21274
>                 URL: https://issues.apache.org/jira/browse/SPARK-21274
>             Project: Spark
>          Issue Type: New Feature
>          Components: Optimizer, SQL
>    Affects Versions: 2.0.0, 2.1.0, 2.2.0
>            Reporter: Ruslan Dautkhanov
>              Labels: set, sql
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROM    tab1 t1
>      LEFT OUTER JOIN 
>         tab2 t2
>      ON (
>         (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>      )
> WHERE
>     COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name that can be
also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined above:
> {code}
> SELECT a,b,c
> FROM    tab1 t1
> WHERE 
>    NOT EXISTS
>    (    SELECT 1
>         FROM    t1_except_t2_df e
>         WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>    )
> {code}
> So the suggestion is just to use above query rewrites to implement both EXCEPT ALL and
INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message