spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Don Drake <dondr...@gmail.com>
Subject Re: Problem reading Parquet from 1.2 to 1.3
Date Sun, 07 Jun 2015 18:03:59 GMT
Thanks Cheng,  we have a workaround in place for Spark 1.3 (remove
.metadata directory), good to know it will be resolved in 1.4.

-Don

On Sun, Jun 7, 2015 at 8:51 AM, Cheng Lian <lian.cs.zju@gmail.com> wrote:

>  This issue has been fixed recently in Spark 1.4
> https://github.com/apache/spark/pull/6581
>
> Cheng
>
>
> On 6/5/15 12:38 AM, Marcelo Vanzin wrote:
>
> I talked to Don outside the list and he says that he's seeing this issue
> with Apache Spark 1.3 too (not just CDH Spark), so it seems like there is a
> real issue here.
>
> On Wed, Jun 3, 2015 at 1:39 PM, Don Drake <dondrake@gmail.com> wrote:
>
>> As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I noticed that
>> Spark is behaving differently when reading Parquet directories that contain
>> a .metadata directory.
>>
>>  It seems that in spark 1.2.x, it would just ignore the .metadata
>> directory, but now that I'm using Spark 1.3, reading these files causes the
>> following exceptions:
>>
>>  scala> val d = sqlContext.parquetFile("/user/ddrak/parq_dir")
>>
>> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>
>> SLF4J: Defaulting to no-operation (NOP) logger implementation
>>
>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
>> further details.
>>
>> scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown
>> during a parallel computation: java.lang.RuntimeException:
>> hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schema.avsc is not a
>> Parquet file. expected magic number at tail [80, 65, 82, 49] but found
>> [116, 34, 10, 125]
>>
>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)
>>
>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)
>>
>>
>> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)
>>
>>
>> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)
>>
>> scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
>>
>>
>> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
>>
>> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>>
>> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>>
>> scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
>>
>> scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
>>
>> .
>>
>> .
>>
>> .
>>
>>
>>
>> java.lang.RuntimeException:
>> hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schemas/1.avsc is not
>> a Parquet file. expected magic number at tail [80, 65, 82, 49] but found
>> [116, 34, 10, 125]
>>
>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)
>>
>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)
>>
>>
>> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)
>>
>>
>> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)
>>
>> scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
>>
>>
>> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
>>
>> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>>
>> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>>
>> scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
>>
>> scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
>>
>> .
>>
>> .
>>
>> .
>>
>>
>>
>> java.lang.RuntimeException:
>> hdfs://nameservice1/user/ddrak/parq_dir/.metadata/descriptor.properties
>> is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but
>> found [117, 101, 116, 10]
>>
>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)
>>
>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)
>>
>>
>> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)
>>
>>
>> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)
>>
>> scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
>>
>>
>> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
>>
>> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>>
>> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>>
>> scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
>>
>> scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
>>
>> .
>>
>> .
>>
>> .
>>
>>         at
>> scala.collection.parallel.package$$anon$1.alongWith(package.scala:87)
>>
>>         at
>> scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)
>>
>>         at
>> scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650)
>>
>>         at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)
>>
>>         at
>> scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650)
>>
>>         at
>> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)
>>
>>         at
>> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
>>
>>         at
>> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
>>
>>         at
>> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
>>
>>         at
>> scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
>>
>>         at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>
>>         at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>
>>         at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>
>>         at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>>
>>
>>
>>  When I remove the .metadata directory, it is able to read these parquet
>> files just fine.
>>
>> I feel that Spark should ignore the dot files/directories when attempting
>> to read these parquet files. I'm seeing this in CDH 5.4.2 (Spark 1.3.0 +
>> patches)
>>
>> Thoughts?
>>
>>  --
>>  Donald Drake
>> Drake Consulting
>> http://www.drakeconsulting.com/
>> http://www.MailLaunder.com/
>> 800-733-2143
>>
>
>
>
> --
> Marcelo
>
>
>


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
http://www.MailLaunder.com/
800-733-2143

Mime
View raw message