hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-21217) Optimize range calculation for PTF
Date Fri, 15 Feb 2019 14:41:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-21217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

ASF GitHub Bot updated HIVE-21217:
----------------------------------
    Labels: pull-request-available  (was: )

> Optimize range calculation for PTF
> ----------------------------------
>
>                 Key: HIVE-21217
>                 URL: https://issues.apache.org/jira/browse/HIVE-21217
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Adam Szita
>            Assignee: Adam Szita
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-21217.0.patch, HIVE-21217.1.patch, HIVE-21217.2.patch
>
>
> During window function execution Hive has to iterate on neighbouring rows of the current
row to find the beginning and end of the proper range (on which the aggregation will be executed).
> When we're using range based windows and have many rows with a certain key value this
can take a lot of time. (e.g. partition size of 80M, in which we have 2 ranges of 40M rows
according to the orderby column: within these 40M rowsets we're doing 40M x 40M/2 steps..
which is of n^2 time complexity)
> I propose to introduce a cache that keeps track of already calculated range ends so it
can be reused in future scans.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message