hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liyunzhang (JIRA)" <>
Subject [jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark
Date Wed, 27 Dec 2017 08:38:00 GMT


liyunzhang commented on HIVE-18301:

My understanding is these information will be lost if the HadoopRDD is cached.

You mean that HadoopRDD will not store the spark plan? If yes, actually in hive, it stores
the spark plan on a file on  hdfs and deserialize and serialize from the file. See more code
hive/ql/src/java/org/apache/hadoop/hive/ql/exec/  If not, please
spend more time to explain detail.

Here my question is that is there any other reason to disable MapInput#cache besides  avoiding
 multi-insert cases which there is union operator after {{from}}

from (select * from dec union all select * from dec2) s
insert overwrite table dec3 select, sum(s.value) group by
insert overwrite table dec4 select, s.value order by s.value;


If there is no other reason to disable MapInput# cache, I guess  for HIVE-17486, we can enable
MapInput cache because HIVE-17486 is merge same single table.  There is few case like above
( from (select A union B) ....).

> Investigate to enable MapInput cache in Hive on Spark
> -----------------------------------------------------
>                 Key: HIVE-18301
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang
>            Assignee: liyunzhang
> Before IOContext problem is found in MapTran when spark rdd cache is enabled in HIVE-8920.
> so we disabled rdd cache in MapTran at [SparkPlanGenerator|].
 The problem is IOContext seems not initialized correctly in the spark yarn client/cluster
mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most recent failure:
Lost task 93.3 in stage 0.0 (TID 616, bdpe48): java.lang.RuntimeException: Error processing
row: java.lang.NullPointerException
> 	at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(
> 	at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(
> 	at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(
> 	at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(
> 	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
> 	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(
> 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> 	at
> 	at org.apache.spark.executor.Executor$
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(
> 	at java.util.concurrent.ThreadPoolExecutor$
> 	at
> Caused by: java.lang.NullPointerException
> 	at org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(
> 	at org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(
> 	at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(
> 	at org.apache.hadoop.hive.ql.exec.MapOperator.process(
> 	at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(
> 	... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes [ExecMapperContext#currentInputPath|]
is null when rdd cach is enabled.

This message was sent by Atlassian JIRA

View raw message