hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jesus Camacho Rodriguez (JIRA)" <>
Subject [jira] [Commented] (HIVE-15474) Extend limit propagation for chain of RS-GB-RS operators
Date Wed, 28 Dec 2016 13:36:58 GMT


Jesus Camacho Rodriguez commented on HIVE-15474:

[~lirui], sorry for taking some time replying, I was away for a few days.

I have updated the description of this issue to try to explain better what I am trying to
do. I think initial description was too brief and vague.

Given the physical plan you provide, we will propagate the limit from RS4 into RS2 (observe
the explain plan above). RS2 produces the _top N_ keys for each partition; thus, GBY3 operator
produces only results for those keys. Observe in the patch that there is no change for the
GBY operator.

Concerning Spark. My current understanding is that the chain of operators is the same. But
I was thinking further about it, and this optimization should not pose any problem in that
context, since GBY logic has not changed. If Spark chooses to ignore RS2 since it is not sorting
the input for GBY3, that should be fine: the limit is in the RS2 operator, not in GBY3. Spark
will not benefit from the optimization, but it still remains correct.

Actually I'm also wondering: if we use parallel order by (to use a range partitioner rather
than a hash partitioner in RS2), we can do the groupBy and orderBy in a single stage, which
may improve performance in some cases.
It might be beneficial in some cases indeed. However, it is a complex cost-based decision
which would need multiple extensions, as I can think on multiple factors that would influence
it, e.g., data skew, number of records for the top N groups, the limit of records itself,

> Extend limit propagation for chain of RS-GB-RS operators
> --------------------------------------------------------
>                 Key: HIVE-15474
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: Physical Optimizer
>    Affects Versions: 2.2.0
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Jesus Camacho Rodriguez
>         Attachments: HIVE-15474.patch
> The goal is to extend the work started in HIVE-14002.
> For instance, given the following query:
> {code:sql}
> explain
> select key, value, count(key + 1) as agg1 from src 
> group by key, value
> order by key, value, agg1 limit 20;
> {code}
> We generate the following physical plan:
> {{TS1 - GBY2 - RS3 - GBY4 - RS5 - SEL6 - LIM7 - FS8}}
> We can push the limit to RS3 operator, as we will generate records for the _top N_ keys,
and thus, GBY4 will produce the _top N_ results. However, currently we do not do it.

This message was sent by Atlassian JIRA

View raw message