spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicholas Chammas (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input
Date Mon, 09 Apr 2018 22:26:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nicholas Chammas updated SPARK-23945:
-------------------------------------
    Description: 
In SQL you can filter rows based on the result of a subquery:
{code:java}
SELECT *
FROM table1
WHERE name NOT IN (
    SELECT name
    FROM table2
);{code}
In the Spark DataFrame API, the equivalent would probably look like this:
{code:java}
(table1
    .where(
        ~col('name').isin(
            table2.select('name')
        )
    )
){code}
However, .isin() currently [only accepts a local list of values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].

I imagine making this enhancement would happen as part of a larger effort to support correlated
subqueries in the DataFrame API.

Or perhaps there is no plan to support this style of query in the DataFrame API, and queries
like this should instead be written in a different way? How would we write a query like the
one I have above in the DataFrame API, without needing to collect values locally for the NOT
IN filter?

 

  was:
In SQL you can filter rows based on the result of a subquery:

 
{code:java}
SELECT *
FROM table1
WHERE name NOT IN (
    SELECT name
    FROM table2
);{code}
In the Spark DataFrame API, the equivalent would probably look like this:
{code:java}
(table1
    .where(
        ~col('name').isin(
            table2.select('name')
        )
    )
){code}
However, .isin() currently [only accepts a local list of values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].

I imagine making this enhancement would happen as part of a larger effort to support correlated
subqueries in the DataFrame API.

Or perhaps there is no plan to support this style of query in the DataFrame API, and queries
like this should instead be written in a different way? How would we write a query like the
one I have above in the DataFrame API, without needing to collect values locally for the NOT
IN filter?

 


> Column.isin() should accept a single-column DataFrame as input
> --------------------------------------------------------------
>
>                 Key: SPARK-23945
>                 URL: https://issues.apache.org/jira/browse/SPARK-23945
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Nicholas Chammas
>            Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
>     SELECT name
>     FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
>     .where(
>         ~col('name').isin(
>             table2.select('name')
>         )
>     )
> ){code}
> However, .isin() currently [only accepts a local list of values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to support
correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame API, and
queries like this should instead be written in a different way? How would we write a query
like the one I have above in the DataFrame API, without needing to collect values locally
for the NOT IN filter?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message