hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jesus Camacho Rodriguez (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-23365) Put RS deduplication optimization under cost based decision
Date Mon, 01 Jun 2020 16:36:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17121151#comment-17121151
] 

Jesus Camacho Rodriguez commented on HIVE-23365:
------------------------------------------------

Left a minor comment in the review. Other than that, patch LGTM. +1 (pending tests)

> Put RS deduplication optimization under cost based decision
> -----------------------------------------------------------
>
>                 Key: HIVE-23365
>                 URL: https://issues.apache.org/jira/browse/HIVE-23365
>             Project: Hive
>          Issue Type: Improvement
>          Components: Physical Optimizer
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-23365.01.patch, HIVE-23365.02.patch, HIVE-23365.03.patch, HIVE-23365.04.patch,
HIVE-23365.05.patch
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, RS deduplication is always executed whenever it is semantically correct. However,
it could be beneficial to leave both RS operators in the plan, e.g., if the NDV of the second
RS is very low. Thus, we would like this decision to be cost-based. We could use a simple
heuristic that would work fine for most of the cases without introducing regressions for existing
cases, e.g., if NDV for partition column is less than estimated parallelism in the second
RS, do not execute deduplication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message