hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Naveen Gangam (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-19889) Wrong results due to PPD of non deterministic functions with CBO
Date Thu, 21 Jun 2018 20:18:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-19889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Naveen Gangam updated HIVE-19889:
---------------------------------
      Resolution: Fixed
    Hadoop Flags: Reviewed
          Status: Resolved  (was: Patch Available)

Fix has been committed to master. Thank you for your contribution [~janulatha]

> Wrong results due to PPD of non deterministic functions with CBO
> ----------------------------------------------------------------
>
>                 Key: HIVE-19889
>                 URL: https://issues.apache.org/jira/browse/HIVE-19889
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Janaki Lahorani
>            Assignee: Janaki Lahorani
>            Priority: Major
>             Fix For: 4.0.0
>
>         Attachments: HIVE-19889.1.patch, HIVE-19889.2.patch
>
>
> The following query can give wrong results when CBO is on:
> {code}
> select * from (
> select part1,randum123
> from (SELECT *, cast(rand() as double) AS randum123 FROM testA where part1='CA' and part2
= 'ABC') a
> where randum123 <= 0.5) s where s.randum123 > 0.25 limit 20;
> The plan of the query is as follows:
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Map Operator Tree:
>           TableScan
>             alias: testa
>             Statistics: Num rows: 2 Data size: 4580 Basic stats: COMPLETE Column stats:
NONE
>             Filter Operator
>               predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)
>               Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE Column stats:
NONE
>               Select Operator
>                 expressions: 'CA' (type: string), rand() (type: double)
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE Column
stats: NONE
>                 Limit
>                   Number of rows: 20
>                   Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE Column
stats: NONE
>                   File Output Operator
>                     compressed: false
>                     Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE Column
stats: NONE
>                     table:
>                         input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                         output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                         serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: 20
>       Processor Tree:
>         ListSink
> {code}
> The relevant part in the plan is the filter:
> {code}
>             Filter Operator
>               predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)
> {code}
> The predicates randum123 <= 0.5 and s.randum123 > 0.25 were pushed down.  And randum123
was resolved to rand().  This is bad because it will result in invocation of rand() two times
and rand() UDF is non-deterministic.  Both the rand calls can generate values that can satisfy
the predicates independently, but not together, whereas the original intention of the query
is to give results when rand falls between 0.25 and 0.5.
> A sample result:
> {code}
> CA	0.9191984370369802
> CA	0.397933021566812
> {code}
> where the condition was not satisfied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message