hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow
Date Mon, 10 Aug 2015 22:19:45 GMT

    [ https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14680870#comment-14680870
] 

Gopal V commented on HIVE-11502:
--------------------------------

bq. So next step is, how do we fix the issue now?

Easiest would be to use vectorization, which doesn't need any Writables in the inner loop.

The vector hashcode for doubles would automatically be very similar to your impl (from Arrays.hashCode(double[]))

{code}
....
        for (double element : a) {
            long bits = Double.doubleToLongBits(element);
            result = 31 * result + (int)(bits ^ (bits >>> 32));
        }
        return result;
{code}

> Map side aggregation is extremely slow
> --------------------------------------
>
>                 Key: HIVE-11502
>                 URL: https://issues.apache.org/jira/browse/HIVE-11502
>             Project: Hive
>          Issue Type: Bug
>          Components: Logical Optimizer, Physical Optimizer
>    Affects Versions: 1.2.0
>            Reporter: Yongzhi Chen
>            Assignee: Yongzhi Chen
>
> For the query as following:
> {noformat}
> create table tbl2 as 
> select col1, max(col2) as col2 
> from tbl1 group by col1;
> {noformat}
> If the column for group by has many different values (for example 400000) and it is in
type double, the map side aggregation is very slow. I ran the query which took more than 3
hours , after 3 hours, I have to kill the query.
> The same query can finish in 7 seconds, if I turn off map side aggregation by:
> {noformat}
> set hive.map.aggr = false;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message