hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hive QA (JIRA)" <>
Subject [jira] [Commented] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table
Date Thu, 04 Oct 2018 13:10:00 GMT


Hive QA commented on HIVE-20660:

Here are the results of testing the latest attachment:

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 21 failed/errored test(s), 15010 tests executed
*Failed tests:*
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[stat_estimate_related_col] (batchId=43)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udaf_histogram_numeric] (batchId=14)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[acid_no_buckets] (batchId=176)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dpp] (batchId=168)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_partitioned] (batchId=163)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[ptf] (batchId=162)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[reopt_dpp] (batchId=177)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[tez_input_counters] (batchId=179)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vectorization_part] (batchId=165)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vectorized_ptf] (batchId=175)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[vector_join_part_col_char] (batchId=108)

Test results:
Console output:
Test logs:

Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 21 tests failed

This message is automatically generated.

ATTACHMENT ID: 12942300 - PreCommit-HIVE-Build

> Group by statistics estimation could be improved by bounding the total number of rows
to source table
> -----------------------------------------------------------------------------------------------------
>                 Key: HIVE-20660
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: Statistics
>    Affects Versions: 4.0.0
>            Reporter: Vineet Garg
>            Assignee: Vineet Garg
>            Priority: Major
>         Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, HIVE-20660.3.patch
> Currently the stats for group by is estimated by taking product of NDVs of all the keys
and bounding it by the number of rows of its input. This bound could be improved by using
the source table instead of immediate input, the insight in this case is that cardinality/ndvs
of a table can not go beyond the original (outer joins will only add NULLs thereby increasing
the cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same source table/input.
> This will improve the estimation in situations where group by is executed after joins
wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='10000', 'rawDataSize'='18000');
> alter table t1 update statistics for column i set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='100000000', 'rawDataSize'='10000');
> alter table t2 update statistics for column i2 set('numDVs'='10000000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
> 	where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
>             Reduce Operator Tree:
>               Merge Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                 keys:
>                   0 _col1 (type: int)
>                   1 _col0 (type: int)
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 99700 Data size: 797288 Basic stats: COMPLETE Column
>                 Group By Operator
>                   aggregations: count()
>                   keys: _col0 (type: int), _col1 (type: int)
>                   mode: hash
>                   outputColumnNames: _col0, _col1, _col2
>                   Statistics: Num rows: 49850 Data size: 797448 Basic stats: COMPLETE
Column stats: COMPLETE <==========
>                   Reduce Output Operator
>                     key expressions: _col0 (type: int), _col1 (type: int)
>                     sort order: ++
>                     Map-reduce partition columns: _col0 (type: int), _col1 (type: int)
>                     Statistics: Num rows: 49850 Data size: 797448 Basic stats: COMPLETE
Column stats: COMPLETE
>                     value expressions: _col2 (type: bigint)
>   .....................
> {code}

This message was sent by Atlassian JIRA

View raw message