spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pei-Lun Lee <pl...@appier.com>
Subject Re: Spark SQL 1.0.1 error on reading fixed length byte array
Date Mon, 04 Aug 2014 06:53:30 GMT
Hi,

We have a PR to support fixed length byte array in parquet file.

https://github.com/apache/spark/pull/1737

Can someone help verifying it?

Thanks.

2014-07-15 19:23 GMT+08:00 Pei-Lun Lee <pllee@appier.com>:

> Sorry, should be SPARK-2489
>
>
> 2014-07-15 19:22 GMT+08:00 Pei-Lun Lee <pllee@appier.com>:
>
> Filed SPARK-2446
>>
>>
>>
>> 2014-07-15 16:17 GMT+08:00 Michael Armbrust <michael@databricks.com>:
>>
>> Oh, maybe not.  Please file another JIRA.
>>>
>>>
>>> On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee <pllee@appier.com> wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> Good to know it is being handled. I tried master branch (9fe693b5) and
>>>> got another error:
>>>>
>>>> scala> sqlContext.parquetFile("/tmp/foo")
>>>> java.lang.RuntimeException: Unsupported parquet datatype optional
>>>> fixed_len_byte_array(4) b
>>>> at scala.sys.package$.error(package.scala:27)
>>>> at
>>>> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58)
>>>>  at
>>>> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109)
>>>> at
>>>> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282)
>>>>  at
>>>> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279)
>>>>         ......
>>>>
>>>> The avro schema I used is something like:
>>>>
>>>> protocol Test {
>>>>     fixed Bytes4(4);
>>>>
>>>>     record User {
>>>>         string name;
>>>>         int age;
>>>>         union {null, int} i;
>>>>         union {null, int} j;
>>>>         union {null, Bytes4} b;
>>>>         union {null, bytes} c;
>>>>         union {null, int} d;
>>>>     }
>>>> }
>>>>
>>>> Is this case included in SPARK-2446
>>>> <https://issues.apache.org/jira/browse/SPARK-2446>?
>>>>
>>>>
>>>> 2014-07-15 3:54 GMT+08:00 Michael Armbrust <michael@databricks.com>:
>>>>
>>>> This is not supported yet, but there is a PR open to fix it:
>>>>> https://issues.apache.org/jira/browse/SPARK-2446
>>>>>
>>>>>
>>>>> On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee <pllee@appier.com>
wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am using spark-sql 1.0.1 to load parquet files generated from
>>>>>> method described in:
>>>>>>
>>>>>> https://gist.github.com/massie/7224868
>>>>>>
>>>>>>
>>>>>> When I try to submit a select query with columns of type fixed length
>>>>>> byte array, the following error pops up:
>>>>>>
>>>>>>
>>>>>> 14/07/14 11:09:14 INFO scheduler.DAGScheduler: Failed to run take
at
>>>>>> basicOperators.scala:100
>>>>>> org.apache.spark.SparkDriverExecutionException: Execution error
>>>>>>         at
>>>>>> org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:581)
>>>>>>         at
>>>>>> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:559)
>>>>>> Caused by: parquet.io.ParquetDecodingException: Can not read value
at
>>>>>> 0 in block -1 in file s3n://foo/bar/part-r-00000.snappy.parquet
>>>>>>         at
>>>>>> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177)
>>>>>>         at
>>>>>> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
>>>>>>         at
>>>>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122)
>>>>>>         at
>>>>>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>>>>>>         at
>>>>>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>>>>>         at
>>>>>> scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>>>>>>         at
>>>>>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>>>>>         at
>>>>>> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
>>>>>>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>>>>>         at
>>>>>> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>>>>>         at
>>>>>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>>>>>         at
>>>>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>>>>>>         at
>>>>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>>>>>>         at scala.collection.TraversableOnce$class.to
>>>>>> (TraversableOnce.scala:273)
>>>>>>         at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>>>>>>         at
>>>>>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>>>>>>         at
>>>>>> scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>>>>>>         at
>>>>>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>>>>>>         at
>>>>>> scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>>>>>>         at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:989)
>>>>>>         at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:989)
>>>>>>         at
>>>>>> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
>>>>>>         at
>>>>>> org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:574)
>>>>>>         ... 1 more
>>>>>> Caused by: java.lang.ClassCastException: Expected instance of
>>>>>> primitive converter but got
>>>>>> "org.apache.spark.sql.parquet.CatalystNativeArrayConverter"
>>>>>>         at
>>>>>> parquet.io.api.Converter.asPrimitiveConverter(Converter.java:30)
>>>>>>         at
>>>>>> parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:264)
>>>>>>         at
>>>>>> parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60)
>>>>>>         at
>>>>>> parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
>>>>>>         at
>>>>>> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110)
>>>>>>         at
>>>>>> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
>>>>>>         ... 24 more
>>>>>>
>>>>>>
>>>>>> Is fixed length byte array supposed to work in this version? I
>>>>>> noticed that other array types like int or string already work.
>>>>>>
>>>>>> Thanks,
>>>>>> --
>>>>>> Pei-Lun
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message