spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiao Li <gatorsm...@gmail.com>
Subject Re: ORC native in Spark 2.3, with zlib, gives java.nio.BufferUnderflowException during read
Date Tue, 27 Mar 2018 23:13:19 GMT
Hi, Eirik,

Yes, please open a JIRA.

Thanks,

Xiao

2018-03-23 8:03 GMT-07:00 Eirik Thorsnes <eirik.thorsnes@uni.no>:

> Hi all,
>
> I'm trying the new ORC native in Spark 2.3
> (org.apache.spark.sql.execution.datasources.orc).
>
> I've compiled Spark 2.3 from the git branch-2.3 as of March 20th.
> I also get the same error for the Spark 2.2 from Hortonworks HDP 2.6.4.
>
> *NOTE*: the error only occurs with zlib compression, and I see that with
> Snappy I get an extra log-line saying "OrcCodecPool: Got brand-new codec
> SNAPPY". Perhaps zlib codec is never loaded/triggered in the new code?
>
> I can write using the new native codepath without errors, but *reading*
> zlib-compressed ORC, either the newly written ORC-files *or* older
> ORC-files written with Spark 2.2/1.6 I get the following exception.
>
> ======= cut =========
> 2018-03-23 10:36:08,249 INFO FileScanRDD: Reading File path:
> hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-
> 37dc216b8a99.orc,
> range: 0-134217728, partition values: [1999]
> 2018-03-23 10:36:08,326 INFO ReaderImpl: Reading ORC rows from
> hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc
> with {include: [true, true, true, true, true, true, true, true, true],
> offset: 0, length: 134217728}
> 2018-03-23 10:36:08,326 INFO RecordReaderImpl: Reader schema not
> provided -- using file schema
> struct<datetime:timestamp,lon:float,lat:float,u10:smallint,
> v10:smallint,lcc:smallint,mcc:smallint,hcc:smallint>
>
> 2018-03-23 10:36:08,824 ERROR Executor: Exception in task 0.0 in stage
> 1.0 (TID 1)
> java.nio.BufferUnderflowException
>         at java.nio.Buffer.nextGetIndex(Buffer.java:500)
>         at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:249)
>         at
> org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:248)
>         at
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(
> RunLengthIntegerReaderV2.java:58)
>         at
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(
> RunLengthIntegerReaderV2.java:323)
>         at
> org.apache.orc.impl.TreeReaderFactory$TimestampTreeReader.
> nextVector(TreeReaderFactory.java:976)
>         at
> org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(
> TreeReaderFactory.java:1815)
>         at
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
>         at
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.
> nextBatch(OrcColumnarBatchReader.scala:186)
>         at
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.
> nextKeyValue(OrcColumnarBatchReader.scala:114)
>         at
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(
> RecordReaderIterator.scala:39)
>         at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(
> FileScanRDD.scala:105)
>         at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.
> nextIterator(FileScanRDD.scala:177)
>         at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(
> FileScanRDD.scala:105)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.scan_nextBatch$(Unknown
> Source)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.processNext(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.BufferedRowIterator.
> hasNext(BufferedRowIterator.java:43)
>         at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$
> anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>         at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 2.apply(SparkPlan.scala:234)
>         at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 2.apply(SparkPlan.scala:228)
>         at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$25.apply(RDD.scala:827)
>         at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$25.apply(RDD.scala:827)
>         at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>         at org.apache.spark.scheduler.Task.run(Task.scala:108)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> ======= cut =========
>
> I have the following set in spark-defaults.conf:
>
> spark.sql.hive.convertMetastoreOrc true
> spark.sql.orc.char.enabled true
> spark.sql.orc.enabled true
> spark.sql.orc.filterPushdown true
> spark.sql.orc.impl native
> spark.sql.orc.enableVectorizedReader true
>
>
> If I set these to false and use the old hive reader (or specify the full
> classname for the old hive reader in the spark-shell) I get results OK
> with both new and old orc-files.
>
> If I use Snappy compression it works with the new reader without error.
>
> NOTE: I'm running on Hortonworks HDP 2.6.4 (Hadoop 2.7.3) and I also get
> the same error for the Spark 2.2 there which I understand has many of
> the patches from the Spark 2.3 branch.
>
> Should this be reported in the JIRA system?
>
> Regards,
> Eirik
>
> --
> Eirik Thorsnes
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message