Nick, thanks for raising this.

It looks useful to have something in the DF API that behaves like sub-queries, but I’m not sure that passing a DF works. Making every method accept a DF that may contain matching data seems like it puts a lot of work on the API — which now has to accept a DF all over the place.

What about exposing transforms that make it easy to coerce data to what the method needs? Instead of passing a dataframe, you’d pass df.toSet to isin:

val subQ = spark.sql("select distinct filter_col from source")
val df = table.filter($"col".isin(subQ.toSet))

That also distinguishes between a sub-query and a correlated sub-query that uses values from the outer query. We would still need to come up with syntax for the correlated case, unless there’s a proposal already.


On Mon, Apr 9, 2018 at 3:56 PM, Nicholas Chammas <> wrote:
I just submitted SPARK-23945 but wanted to double check here to make sure I didn't miss something fundamental.

Correlated subqueries are tracked at a high level in SPARK-18455, but it's not clear to me whether they are "design-appropriate" for the DataFrame API.

Are correlated subqueries a thing we can expect to have in the DataFrame API?


Ryan Blue
Software Engineer