spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maryann Xue <maryann....@databricks.com>
Subject Re: [DISCUSS] Out of order optimizer rules?
Date Wed, 02 Oct 2019 20:52:15 GMT
There is no internal write up, but I think we should at least give some
up-to-date description on that JIRA entry.

On Wed, Oct 2, 2019 at 3:13 PM Reynold Xin <rxin@databricks.com> wrote:

> No there is no separate write up internally.
>
> On Wed, Oct 2, 2019 at 12:29 PM Ryan Blue <rblue@netflix.com> wrote:
>
>> Thanks for the pointers, but what I'm looking for is information about
>> the design of this implementation, like what requires this to be in
>> spark-sql instead of spark-catalyst.
>>
>> Even a high-level description, like what the optimizer rules are and what
>> they do would be great. Was there one written up internally that you could
>> share?
>>
>> On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue <maryann.xue@databricks.com>
>> wrote:
>>
>>> > It lists 3 cases for how a filter is built, but nothing about the
>>> overall approach or design that helps when trying to find out where it
>>> should be placed in the optimizer rules.
>>>
>>> The overall idea/design of DPP can be simply put as using the result of
>>> one side of the join to prune partitions of a scan on the other side. The
>>> optimal situation is when the join is a broadcast join and the table being
>>> partition-pruned is on the probe side. In that case, by the time the probe
>>> side starts, the filter will already have the results available and ready
>>> for reuse.
>>>
>>> Regarding the place in the optimizer rules, it's preferred to happen
>>> late in the optimization, and definitely after join reorder.
>>>
>>>
>>> Thanks,
>>> Maryann
>>>
>>> On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin <rxin@databricks.com> wrote:
>>>
>>>> Whoever created the JIRA years ago didn't describe dpp correctly, but
>>>> the linked jira in Hive was correct (which unfortunately is much more terse
>>>> than any of the patches we have in Spark
>>>> https://issues.apache.org/jira/browse/HIVE-9152). Henry R's
>>>> description was also correct.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue <rblue@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> Where can I find a design doc for dynamic partition pruning that
>>>>> explains how it works?
>>>>>
>>>>> The JIRA issue, SPARK-11150, doesn't seem to describe
>>>>> dynamic partition pruning (as pointed out by Henry R.) and doesn't have
any
>>>>> comments about the implementation's approach. And the PR description
also
>>>>> doesn't have much information. It lists 3 cases for how a filter is built,
>>>>> but nothing about the overall approach or design that helps when trying
to
>>>>> find out where it should be placed in the optimizer rules. It also isn't
>>>>> clear why this couldn't be part of spark-catalyst.
>>>>>
>>>>> On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> dynamic partition pruning rule generates "hidden" filters that will
>>>>>> be converted to real predicates at runtime, so it doesn't matter
where we
>>>>>> run the rule.
>>>>>>
>>>>>> For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's
>>>>>> better to run it before join reorder.
>>>>>>
>>>>>> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue <rblue@netflix.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I have been working on a PR that moves filter and projection
>>>>>>> pushdown into the optimizer for DSv2, instead of when converting
to
>>>>>>> physical plan. This will make DSv2 work with optimizer rules
that depend on
>>>>>>> stats, like join reordering.
>>>>>>>
>>>>>>> While adding the optimizer rule, I found that some rules appear
to
>>>>>>> be out of order. For example, PruneFileSourcePartitions that
>>>>>>> handles filter pushdown for v1 scans is in SparkOptimizer
>>>>>>> (spark-sql) in a batch that will run after all of the batches
in
>>>>>>> Optimizer (spark-catalyst) including CostBasedJoinReorder.
>>>>>>>
>>>>>>> SparkOptimizer also adds the new “dynamic partition pruning”
rules
>>>>>>> *after* both the cost-based join reordering and the v1 partition
>>>>>>> pruning rule. I’m not sure why this should run after join reordering
and
>>>>>>> partition pruning, since it seems to me like additional filters
would be
>>>>>>> good to have before those rules run.
>>>>>>>
>>>>>>> It looks like this might just be that the rules were written
in the
>>>>>>> spark-sql module instead of in catalyst. That makes some sense
for the v1
>>>>>>> pushdown, which is altering physical plan details (FileIndex)
that
>>>>>>> have leaked into the logical plan. I’m not sure why the dynamic
partition
>>>>>>> pruning rules aren’t in catalyst or why they run after the
v1 predicate
>>>>>>> pushdown.
>>>>>>>
>>>>>>> Can someone more familiar with these rules clarify why they appear
>>>>>>> to be out of order?
>>>>>>>
>>>>>>> Assuming that this is an accident, I think it’s something that
>>>>>>> should be fixed before 3.0. My PR fixes early pushdown, but the
“dynamic”
>>>>>>> pruning may still need to be addressed.
>>>>>>>
>>>>>>> rb
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Mime
View raw message