hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-17421) Clear incorrect stats after replication
Date Wed, 06 Sep 2017 07:09:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-17421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated HIVE-17421:
------------------------------
    Attachment: HIVE-17421.2.patch

Attach patch with better test.

We can divide the issue into 4 scenario:
1. Bootstrap non partition table: it is replicated with create_table + move task. create_table
generates empty stats, move task clear the stats. So we don't have table stats summary on
the destination
2. Bootstrap partition table: it is replicated with add_partition + move task. add_partition
generates empty stats, move task clear the stats. So we don't have partition stats summary
on the destination
3. Incremental non partition table: there is a alter_table task suppose to update the stats,
however, due to HIVE-17428 (the table does not exist at the time we parse alter_table event),
Hive treat it as a create_table task and will generate empty stats
4. Incremental partition table: there is a alter_partition task suppose to update the stats,
however, due to HIVE-17428 (the table does not exist at the time we parse alter_table event),
Hive treat it as a create_table task and will generate empty stats

Bootstrap replication don't have issue since there is no stats on destination side. The issue
in incremental is likely to be fixed in HIVE-17428. However, in case we can not fix HIVE-17428
in time (or additional issues similarly), I attach a patch nullify the empty stats, so we
won't get wrong result assuming stats is correct.

> Clear incorrect stats after replication
> ---------------------------------------
>
>                 Key: HIVE-17421
>                 URL: https://issues.apache.org/jira/browse/HIVE-17421
>             Project: Hive
>          Issue Type: Bug
>          Components: repl
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>         Attachments: HIVE-17421.1.patch, HIVE-17421.2.patch
>
>
> After replication, some stats summary are incorrect. If hive.compute.query.using.stats
set to true, we will get wrong result on the destination side.
> This will not happen with bootstrap replication. This is because stats summary are in
table properties and will be replicated to the destination. However, in incremental replication,
this won't work. When creating table, the stats summary are empty (eg, numRows=0). Later when
we insert data, stats summary are updated with update_table_column_statistics/update_partition_column_statistics,
however, both events are not captured in incremental replication. Thus on the destination
side, we will get count\(*\)=0. The simple solution is to remove COLUMN_STATS_ACCURATE property
after incremental replication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message