spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco Gaido <marcogaid...@gmail.com>
Subject Re: Transform plan with scope
Date Tue, 24 Apr 2018 21:04:19 GMT
Hi Joseph, Herman,

thanks for your answers. The specific rule I was looking at
is FoldablePropagation. If you look at it, what is done is that first a
AttributeMap containing all the possible foldable alias is collected, then
they are replace in the whole plan (it is a bit more complex than this, I
know, this is just a simplification). So in this scenario, if we have two
aliases in completely separated trees we are nonetheless replacing them,
since we have no idea of which is the scope where each of them is available
in.

I know that this specific problem can be solved in a much easier way: we
are facing a weird situation where there are two aliases with the same
exprId and this is not a very good situation (we can just fix it in the
analyzer, as you proposed Herman). But logically, it would make more sense
that a replacement like this is enforced only where an attribute is in
scope (in other places it should not occur) and I think that if we perform
the things as they are logically needed we are less likely to introduce in
weird bugs caused by situations which we never thought as possible (or
which they are not, but they may become). So I was thinking that if such an
operation could be useful also in other places, then probably we could
introduce it: that is the reason of this email thread, understanding if we
need it or if my idea is worthless (in this case, I apologize for wasting
your time).

Yes, Herman, the management of the state is the hardest point. My best idea
so far (but I am not satisfied with it) is to enforce that the state which
is passed from the each child to the parent extends both Growable and
Shrinkable and we pass two functions which for each node return
respectively the items to discard form the state and to add to it. But in
the case we think/decide that such an operation I proposed might be useful,
probably we can spend more time on investigating the best solution (any
suggestion in case would be very welcomed).

Any more thoughts on this?

Thanks for your answers and your time,
Marco

2018-04-24 19:47 GMT+02:00 Herman van Hövell tot Westerflier <
herman@databricks.com>:

> Hi Marco,
>
> In the case of SPARK-24015 we should perhaps fix this in the analyzer. It
> seems that the plan is somewhat invalid.
>
> If you do want to fix it in the optimizer we might be able to fix it using
> an approach for similar to RemoveRedundantAliases (which manually recurses
> down the tree).
>
> As for your proposal we could explore this a little bit. My main question
> would be how would you pass information up the tree (from child to parent),
> and how would you merge such information if there are multiple children? It
> might be kind of painful to generalize.
>
> - Herman
>
> On Tue, Apr 24, 2018 at 7:37 PM, Joseph Torres <
> joseph.torres@databricks.com> wrote:
>
>> Is there some transformation we'd want to apply to that tree, but can't
>> because we have no concept of scope? It's already possible for a plan rule
>> to traverse each node's subtree if it wants.
>>
>> On Tue, Apr 24, 2018 at 10:18 AM, Marco Gaido <marcogaido91@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> working on SPARK-24051 I realized that currently in the Optimizer and in
>>> all the places where we are transforming a query plan, we are lacking the
>>> context information of what is in scope and what is not.
>>>
>>> Coming back to the ticket, the bug reported in the ticket is caused
>>> mainly by two reasons:
>>>  1 - we have two aliases in different places of the plan;
>>>  2 - (the focus of this email) we apply all the rules globally over the
>>> whole plan, without any notion of scope where something is
>>> reachable/visible or not.
>>>
>>> I will start with an easy example to explain what I mean. If we have a
>>> simple query like:
>>>
>>> select a, b from (
>>>   select 1 as a, 2 as b from table1
>>>     union
>>>   select 3 as a, 4 as b from table2) q
>>>
>>> We produce a tree which is logically something like:
>>>
>>> Project0(a, b)
>>> -   Union
>>> --    Project1 (a, b)
>>> ---     ScanTable1
>>> --    Project 2(a, b)
>>> ---     ScanTable2
>>>
>>> So when we apply a transformation on Project1 for instance, we have no
>>> information about what is coming from ScanTable1 (or in general any node
>>> which is part of the subtree whose root is Project1): we miss a stateful
>>> transform which allows the children to tell the parent, grandparents, and
>>> so on what is in their scope. This is in particular true for the
>>> Attributes: in a node we have no idea if an Attribute comes from its
>>> subtree (it is in scope) or not.
>>>
>>> So, the point of this email is: do you think in general might be useful
>>> to introduce a way of navigating the tree which allows the children to keep
>>> a state to be used by their parents? Or do you think it is useful in
>>> general to introduce the concept of scope (if an attribute can be accessed
>>> by a node of a plan)?
>>>
>>> Thanks,
>>> Marco
>>>
>>>
>>>
>>
>

Mime
View raw message