hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rui Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver
Date Fri, 08 Dec 2017 10:21:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283310#comment-16283310
] 

Rui Li commented on HIVE-18148:
-------------------------------

The following query can reproduce the issue (assuming part1.p and part2.q are the partition
columns respectively):
{code}
explain select * from src join part1 on src.key=part1.p join part2 on src.value=part2.q;
{code}
It creates an OP tree like this:
{code}
/**
 *         TS(part1)    TS(src)
 *             |           |
 *            ...         FIL
 *             |          |  \
 *            RS         RS  SEL
 *              \        /    |
 * TS(part2)       JOIN1     GBY
 *     |         /     \      |
 *    RS        RS    SEL    DPP1
 *     \       /       |
 *       JOIN2        GBY
 *                     |
 *                    DPP2
 */
{code}
DPP1 is for part1 and DPP2 is for part2.

In SplitOpTreeForDPP, we process DPP2 first and clone the sub-tree above JOIN1, which means
DPP1 also gets cloned in the new sub-tree. Therefore DPP1 in the cloned tree will be left
unprocessed and eventually causes the NPE.

I think the case here is some kind of a "nested" DPP. If we strictly follow our current DPP
logic, we should firstly launch a job to evaluate DPP1, and then use the result to prune part1,
in order to evaluate JOIN1 and DPP2. Then we run the "real" query and use the results of DPP1
and DPP2 to prune part1 and part2 respectively.

This will add complexity to our compiler and may not even be good for performance, especially
when the query involves more joins. Therefore I think we should avoid such case.
[~stakiar], [~xuefuz], [~kellyzly], [~csun] what do you think?

> NPE in SparkDynamicPartitionPruningResolver
> -------------------------------------------
>
>                 Key: HIVE-18148
>                 URL: https://issues.apache.org/jira/browse/HIVE-18148
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Rui Li
>            Assignee: Rui Li
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] ql.Driver:
FAILED: NullPointerException null
> java.lang.NullPointerException
>         at org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
>         at org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
>         at org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
>         at org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
>         at org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
>         at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. The root
cause seems to be a malformed operator tree generated by SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message