hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rui Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency
Date Thu, 21 Jun 2018 13:19:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519357#comment-16519357
] 

Rui Li commented on HIVE-19671:
-------------------------------

[~xuefuz], I agree it's not trivial to solve this on Hive side. Maybe we can at least print
some warning if the query has nondeterministic partitioning?
And another potential solution is to retry all downstream tasks when any upstream task fails,
which needs help from the execution engine.

> Distribute by rand() can lead to data inconsistency
> ---------------------------------------------------
>
>                 Key: HIVE-19671
>                 URL: https://issues.apache.org/jira/browse/HIVE-19671
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rui Li
>            Assignee: Rui Li
>            Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message