hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth Jayachandran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-12491) Column Statistics: 3 attribute join on a 2-source table is off
Date Tue, 01 Dec 2015 05:03:11 GMT

    [ https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033094#comment-15033094
] 

Prasanth Jayachandran commented on HIVE-12491:
----------------------------------------------

NDVs for UDFs is currently assumed to be worst case which is the number of rows. Ideally,
for built-in UDFs, if the UDFType is non-deterministic then we should assume the above worst
case (ex. UDFRand) as we cannot estimate the output NDV. But if the UDF is deterministic like
UDFMonth then we should instead use the NDV of the column referenced in the UDF. 

> Column Statistics: 3 attribute join on a 2-source table is off
> --------------------------------------------------------------
>
>                 Key: HIVE-12491
>                 URL: https://issues.apache.org/jira/browse/HIVE-12491
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.3.0, 2.0.0
>            Reporter: Gopal V
>            Assignee: Prasanth Jayachandran
>         Attachments: HIVE-12491.WIP.patch
>
>
> The eased out denominator has to detect duplicate row-stats from different attributes.
> {code}
> select account_id from customers c,  customer_activation ca
>   where c.customer_id = ca.customer_id
>   and year(ca.dt) = year(c.dt) and month(ca.dt) = month(c.dt)
>   and year(ca.dt) between year('2013-12-26') and year('2013-12-26')
> {code}
> {code}
>   private Long getEasedOutDenominator(List<Long> distinctVals) {
>       // Exponential back-off for NDVs.
>       // 1) Descending order sort of NDVs
>       // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * ....
>       Collections.sort(distinctVals, Collections.reverseOrder());
>       long denom = distinctVals.get(0);
>       for (int i = 1; i < distinctVals.size(); i++) {
>         denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << i)));
>       }
>       return denom;
>     }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 of which
are derived from the same column.
> {code}
>         Reduce Output Operator (RS_12)
>           key expressions: _col0 (type: bigint), year(_col2) (type: int), month(_col2)
(type: int)
>           sort order: +++
>           Map-reduce partition columns: _col0 (type: bigint), year(_col2) (type: int),
month(_col2) (type: int)
>           value expressions: _col1 (type: bigint)
>           Join Operator (JOIN_13)
>             condition map:
>                  Inner Join 0 to 1
>             keys:
>               0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) (type: int)
>               1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) (type: int)
>             outputColumnNames: _col3
> {code}
> So the eased out denominator is off by a factor of 30,000 or so, causing OOMs in map-joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message