Hi, Eirik,

Yes, please open a JIRA. 

Thanks,

Xiao

2018-03-23 8:03 GMT-07:00 Eirik Thorsnes <eirik.thorsnes@uni.no>:
Hi all,

I'm trying the new ORC native in Spark 2.3
(org.apache.spark.sql.execution.datasources.orc).

I've compiled Spark 2.3 from the git branch-2.3 as of March 20th.
I also get the same error for the Spark 2.2 from Hortonworks HDP 2.6.4.

*NOTE*: the error only occurs with zlib compression, and I see that with
Snappy I get an extra log-line saying "OrcCodecPool: Got brand-new codec
SNAPPY". Perhaps zlib codec is never loaded/triggered in the new code?

I can write using the new native codepath without errors, but *reading*
zlib-compressed ORC, either the newly written ORC-files *or* older
ORC-files written with Spark 2.2/1.6 I get the following exception.

======= cut =========
2018-03-23 10:36:08,249 INFO FileScanRDD: Reading File path:
hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc,
range: 0-134217728, partition values: [1999]
2018-03-23 10:36:08,326 INFO ReaderImpl: Reading ORC rows from
hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc
with {include: [true, true, true, true, true, true, true, true, true],
offset: 0, length: 134217728}
2018-03-23 10:36:08,326 INFO RecordReaderImpl: Reader schema not
provided -- using file schema
struct<datetime:timestamp,lon:float,lat:float,u10:smallint,v10:smallint,lcc:smallint,mcc:smallint,hcc:smallint>

2018-03-23 10:36:08,824 ERROR Executor: Exception in task 0.0 in stage
1.0 (TID 1)
java.nio.BufferUnderflowException
        at java.nio.Buffer.nextGetIndex(Buffer.java:500)
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:249)
        at
org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:248)
        at
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:58)
        at
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
        at
org.apache.orc.impl.TreeReaderFactory$TimestampTreeReader.nextVector(TreeReaderFactory.java:976)
        at
org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815)
        at
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
        at
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.scala:186)
        at
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.scala:114)
        at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
        at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
        at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
        at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
        at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
Source)
        at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
        at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
        at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
        at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
        at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
        at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
        at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
======= cut =========

I have the following set in spark-defaults.conf:

spark.sql.hive.convertMetastoreOrc true
spark.sql.orc.char.enabled true
spark.sql.orc.enabled true
spark.sql.orc.filterPushdown true
spark.sql.orc.impl native
spark.sql.orc.enableVectorizedReader true


If I set these to false and use the old hive reader (or specify the full
classname for the old hive reader in the spark-shell) I get results OK
with both new and old orc-files.

If I use Snappy compression it works with the new reader without error.

NOTE: I'm running on Hortonworks HDP 2.6.4 (Hadoop 2.7.3) and I also get
the same error for the Spark 2.2 there which I understand has many of
the patches from the Spark 2.3 branch.

Should this be reported in the JIRA system?

Regards,
Eirik

--
Eirik Thorsnes


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org