spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <dongj...@apache.org>
Subject Re: ORC native in Spark 2.3, with zlib, gives java.nio.BufferUnderflowException during read
Date Wed, 28 Mar 2018 01:26:29 GMT
You may hit SPARK-23355 (convertMetastore should not ignore table properties).

Since it's a known Spark issue for all Hive tables (Parquet/ORC), could you check that too?

Bests,
Dongjoon.

On 2018/03/28 01:00:55, Dongjoon Hyun <dongjoon@apache.org> wrote: 
> Hi, Eric.
> 
> For me, Spark 2.3 works correctly like the following. Could you give us some reproducible
example?
> 
> ```
> scala> sql("set spark.sql.orc.impl=native")
> 
> scala> sql("set spark.sql.orc.compression.codec=zlib")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> 
> scala> spark.range(10).write.orc("/tmp/zlib_test")
> 
> scala> spark.read.orc("/tmp/zlib_test").show
> +---+
> | id|
> +---+
> |  8|
> |  9|
> |  5|
> |  0|
> |  3|
> |  4|
> |  6|
> |  7|
> |  1|
> |  2|
> +---+
> 
> scala> sc.version
> res4: String = 2.3.0
> ```
> 
> Bests,
> Dongjoon.
> 
> 
> On 2018/03/23 15:03:29, Eirik Thorsnes <eirik.thorsnes@uni.no> wrote: 
> > Hi all,
> > 
> > I'm trying the new ORC native in Spark 2.3
> > (org.apache.spark.sql.execution.datasources.orc).
> > 
> > I've compiled Spark 2.3 from the git branch-2.3 as of March 20th.
> > I also get the same error for the Spark 2.2 from Hortonworks HDP 2.6.4.
> > 
> > *NOTE*: the error only occurs with zlib compression, and I see that with
> > Snappy I get an extra log-line saying "OrcCodecPool: Got brand-new codec
> > SNAPPY". Perhaps zlib codec is never loaded/triggered in the new code?
> > 
> > I can write using the new native codepath without errors, but *reading*
> > zlib-compressed ORC, either the newly written ORC-files *or* older
> > ORC-files written with Spark 2.2/1.6 I get the following exception.
> > 
> > ======= cut =========
> > 2018-03-23 10:36:08,249 INFO FileScanRDD: Reading File path:
> > hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc,
> > range: 0-134217728, partition values: [1999]
> > 2018-03-23 10:36:08,326 INFO ReaderImpl: Reading ORC rows from
> > hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc
> > with {include: [true, true, true, true, true, true, true, true, true],
> > offset: 0, length: 134217728}
> > 2018-03-23 10:36:08,326 INFO RecordReaderImpl: Reader schema not
> > provided -- using file schema
> > struct<datetime:timestamp,lon:float,lat:float,u10:smallint,v10:smallint,lcc:smallint,mcc:smallint,hcc:smallint>
> > 
> > 2018-03-23 10:36:08,824 ERROR Executor: Exception in task 0.0 in stage
> > 1.0 (TID 1)
> > java.nio.BufferUnderflowException
> >         at java.nio.Buffer.nextGetIndex(Buffer.java:500)
> >         at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:249)
> >         at
> > org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:248)
> >         at
> > org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:58)
> >         at
> > org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
> >         at
> > org.apache.orc.impl.TreeReaderFactory$TimestampTreeReader.nextVector(TreeReaderFactory.java:976)
> >         at
> > org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815)
> >         at
> > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
> >         at
> > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.scala:186)
> >         at
> > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.scala:114)
> >         at
> > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> >         at
> > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
> >         at
> > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
> >         at
> > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
> >         at
> > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
> > Source)
> >         at
> > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> > Source)
> >         at
> > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> >         at
> > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> >         at
> > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
> >         at
> > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> >         at
> > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> >         at
> > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> >         at
> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> >         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> >         at
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> >         at org.apache.spark.scheduler.Task.run(Task.scala:108)
> >         at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >         at java.lang.Thread.run(Thread.java:748)
> > ======= cut =========
> > 
> > I have the following set in spark-defaults.conf:
> > 
> > spark.sql.hive.convertMetastoreOrc true
> > spark.sql.orc.char.enabled true
> > spark.sql.orc.enabled true
> > spark.sql.orc.filterPushdown true
> > spark.sql.orc.impl native
> > spark.sql.orc.enableVectorizedReader true
> > 
> > 
> > If I set these to false and use the old hive reader (or specify the full
> > classname for the old hive reader in the spark-shell) I get results OK
> > with both new and old orc-files.
> > 
> > If I use Snappy compression it works with the new reader without error.
> > 
> > NOTE: I'm running on Hortonworks HDP 2.6.4 (Hadoop 2.7.3) and I also get
> > the same error for the Spark 2.2 there which I understand has many of
> > the patches from the Spark 2.3 branch.
> > 
> > Should this be reported in the JIRA system?
> > 
> > Regards,
> > Eirik
> > 
> > -- 
> > Eirik Thorsnes
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> > 
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message