Nick, thanks for raising this.
It looks useful to have something in the DF API that behaves like sub-queries, but I’m not sure that passing a DF works. Making every method accept a DF that may contain matching data seems like it puts a lot of work on the API — which now has to accept a DF all over the place.
What about exposing transforms that make it easy to coerce data to what the method needs? Instead of passing a dataframe, you’d pass
val subQ = spark.sql("select distinct filter_col from source") val df = table.filter($"col".isin(subQ.toSet))
That also distinguishes between a sub-query and a correlated sub-query that uses values from the outer query. We would still need to come up with syntax for the correlated case, unless there’s a proposal already.
I just submitted SPARK-23945 but wanted to double check here to make sure I didn't miss something fundamental.Correlated subqueries are tracked at a high level in SPARK-18455, but it's not clear to me whether they are "design-appropriate" for the DataFrame API.Are correlated subqueries a thing we can expect to have in the DataFrame API?Nick