spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Laskowski <ja...@japila.pl>
Subject Re: [SQL] Understanding RewriteCorrelatedScalarSubquery optimization (and TreeNode.transform)
Date Mon, 28 May 2018 20:22:06 GMT
Thanks Herman! You've helped me a lot (and am going to use your fine
explanation in my gitbook quoting when needed! :))

p.s. I also found this today that helped me too -->
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/1483312212640900/6987336228780374/latest.html

What about tests? Don't you think the method should have tests? I'm writing
small code snippets anyway while exploring it and have been wondering how
to contribute it back to the Spark project given the method is private. It
looks like I should instead be focusing on the methods of Expression or
even QueryPlan to understand the various methods (as that's what triggered
my question).

Thanks.

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, May 28, 2018 at 11:33 AM, Herman van Hövell tot Westerflier <
herman@databricks.com> wrote:

> Hi Jacek,
>
> RewriteCorrelatedScalarSubquery rewrites a plan containing a scalar
> subquery into a left join and a projection/filter/aggregate.
>
> For example:
> SELECT a_id,
>        (SELECT MAX(value)
>         FROM tbl_b
>         WHERE tbl_b.b_id = tbl_a.a_id) AS max_value
> FROM  tbl_a
>
> Will be rewritten into something like this:
> SELECT a_id,
>        agg.max_value
> FROM  tbl_a
>       JOIN (SELECT   b_id,
>                      MAX(value) AS max_value
>             FROM     tbl_b
>             GROUP BY b_id) agg
>        ON agg.b_id = tbl_a.a_id
>
> The particular function you refer to extracts all correlated scalar
> subqueries by adding them to the subqueries buffer and rewrites the
> expression to use output of the subquery (the s.plan.output.head bit).
> This returned expression then replaces the original expression in the
> operator (Aggregate, Project or Filter) it came from, completing the
> rewrite for that operator. The extracted subqueries are planned as LEFT
> OUTER JOINS below the operator.
>
> I hope this makes sense.
>
> Herman
>
>
>
>
>
> On Sun, May 27, 2018 at 9:43 PM Jacek Laskowski <jacek@japila.pl> wrote:
>
>> Hi,
>>
>> I'm trying to understand RewriteCorrelatedScalarSubquery optimization
>> and how extractCorrelatedScalarSubqueries [1] works. I don't understand
>> how "The expression is rewritten and returned." is done. How is the
>> expression rewritten?
>>
>> Since it's private it's not even possible to write tests and that got me
>> thinking how you go about code like this? How do you know whether it works
>> fine or not? Any help? I'd appreciate.
>>
>> [1] https://github.com/apache/spark/blob/branch-2.3/sql/
>> catalyst/src/main/scala/org/apache/spark/sql/catalyst/
>> optimizer/subquery.scala?utf8=%E2%9C%93#L290-L299
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://about.me/JacekLaskowski
>> Mastering Spark SQL https://bit.ly/mastering-spark-sql
>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>> Follow me at https://twitter.com/jaceklaskowski
>>
>

Mime
View raw message