hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zac Zhou (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-18270) count(distinct) using join and group by produce incorrect output when hive.auto.convert.join=false and hive.auto.convert.join.noconditionaltask=false
Date Tue, 19 Dec 2017 05:18:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-18270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zac Zhou updated HIVE-18270:
----------------------------
    Attachment: HIVE-18270.3.patch

As hive 3.0 has refactored the code, and the ReducesinkDeduplicationUtils class was added
in HIVE-17037.  It looks like that the bug dose not exist in the master.  

> count(distinct) using join and group by produce incorrect output when hive.auto.convert.join=false
and hive.auto.convert.join.noconditionaltask=false
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-18270
>                 URL: https://issues.apache.org/jira/browse/HIVE-18270
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.2.1, 2.1.1, 2.2.0, 2.3.0
>            Reporter: Zac Zhou
>            Assignee: Zac Zhou
>         Attachments: HIVE-18270.1.patch, HIVE-18270.2.patch, HIVE-18270.3.patch
>
>
> When I run the following query:
> explain 
> SELECT foo.id, count(distinct foo.line_id) as factor from 
>  foo JOIN bar ON (foo.id = bar.id)
>  WHERE foo.orders != 'blah'  
>  group by foo.id; 
> The following error is got:
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
> 	at java.util.ArrayList.rangeCheck(ArrayList.java:635)
> 	at java.util.ArrayList.get(ArrayList.java:411)
> 	at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication$AbsctractReducerReducerProc.merge(ReduceSinkDeDuplication.java:216)
> 	at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication$JoinReducerProc.process(ReduceSinkDeDuplication.java:557)
> 	at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication$AbsctractReducerReducerProc.process(ReduceSinkDeDuplication.java:166)
> 	at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
> 	at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:95)
> 	at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:79)
> 	at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:133)
> 	at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:110)
> 	at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication.transform(ReduceSinkDeDuplication.java:108)
> 	at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:192)
> 	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10201)
> 	at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)
> 	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
> 	at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:74)
> 	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
> 	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
> 	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
> 	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
> 	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
> 	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
> 	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
> 	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:213)
> 	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165)
> 	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> 	at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:736)
> 	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
> 	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> It looks like it is a bug of ReduceSinkDeDuplication optimizer. 
> Since the columns of count distinct need to be added into reduce key for sorting, the
reducesink of group can't be replaced with the ones of join. 
> In the case of count distinct query, reducesink of group should not be merged  
>   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message