spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "萝卜丝炒饭" <1427357...@qq.com>
Subject Re: Spark parquet file read problem !
Date Tue, 01 Aug 2017 14:03:07 GMT
😒
I have no idea about this


 
---Original---
From: "serkan taş"<serkan_tas@hotmail.com>
Date: 2017/7/31 16:42:59
To: "pandees waran"<pandeesh@gmail.com>;"萝卜丝炒饭"<1427357147@qq.com>;
Cc: "user@spark.apache.org"<user@spark.apache.org>;
Subject: Re: Spark parquet file read problem !


  Thank you very much.
 
 
  Schema merge fixed the structure problem but the fields with same name but different type
still is an issue i should work on.
 
 
   Android için Outlook uygulamasını edinin
 
 
 
 
  Kimden: 萝卜丝炒饭
 
  Gönderildi: 31 Temmuz Pazartesi 11:16
 
  Konu: Re: Spark parquet file read problem !
 
  Kime: serkan taş, pandees waran
 
  Bilgi: user@spark.apache.org
 
 
 
  please add the schemaMerge to the option.
 
 
  ---Original---
 
  From: "serkan taş"<serkan_tas@hotmail.com>
 
  Date: 2017/7/31 13:54:14
 
  To: "pandees waran"<pandeesh@gmail.com>;
 
  Cc: "user@spark.apache.org"<user@spark.apache.org>;
 
  Subject: Re: Spark parquet file read problem !
 
 
  I checked and realised that the schema of the files different with some missing fields and
some fields with same  name but different type.
 
 
  How may i overcome the issue?
 
 
  Android için Outlook uygulamasını edinin
 
 
  From: pandees waran <pandeesh@gmail.com>
 
  Sent: Sunday, July 30, 2017 7:12:55 PM
 
  To: serkan taş
 
  Cc: user@spark.apache.org
 
  Subject: Re: Spark parquet file read problem ! 
 
   
 
  I have encountered the similar error when the schema / datatypes are conflicting in those
2 parquet files. Are you sure that the 2 individual files are in the same structure with similar
datatypes. If not you have to fix this by enforcing the default values  for the missing values
to make the structure and data types identical.
 
 
  Sent from my iPhone
 
 
  On Jul 30, 2017, at 8:11 AM, serkan taş <serkan_tas@hotmail.com> wrote:
 
 
   Hi, 
 
 
  I have a problem while reading parquet files located in hdfs.
 
 
 
  If i read the files individually nothing wrong and i can get the file content. 
 
 
  parquetFile = spark.read.parquet(“hdfs://xxx/20170719/part-00000-3a9c226f-4fef-44b8-996b-115a2408c746.snappy.parquet")
 
 
  and
 
 
  parquetFile = spark.read.parquet(“hdfs://xxx/20170719/part-00000-17262890-a56c-42e2-b299-2b10354da184.snappy.parquet”)
 
 
 
  But when i try to read the folder, 
 
 
  This is how i read the folder : 
 
 
  parquetFile = spark.read.parquet(“hdfs://xxx/20170719/“)
 
 
 
  than i get the below exception :
 
 
  Note : Only these two files are in the folder. Please find the parquet files attached.
 
 
 
  Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
 
  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0
failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5, localhost, executor
driver): org.apache.parquet.io.ParquetDecodingException: Can not read value  at 1 in block
0 in file hdfs://xxx/20170719/part-00000-3a9c226f-4fef-44b8-996b-115a2408c746.snappy.parquet
 
         at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
 
         at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
 
         at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
 
         at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
 
         at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:166)
 
         at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
 
         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
 
         at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 
         at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
 
         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
 
         at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
 
         at scala.collection.Iterator$class.foreach(Iterator.scala:893)
 
         at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
 
         at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
 
         at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
 
         at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
 
         at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
 
  Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny
cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong
 
         at org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.setLong(SpecificInternalRow.scala:295)
 
         at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setLong(ParquetRowConverter.scala:164)
 
         at org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addLong(ParquetRowConverter.scala:86)
 
         at org.apache.parquet.column.impl.ColumnReaderImpl$2$4.writeValue(ColumnReaderImpl.java:274)
 
         at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:371)
 
         at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
 
         at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
 
         ... 16 more
 
   
 
  Driver stacktrace:
 
         at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
 
         at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
 
         at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
 
         at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 
         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
 
         at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
 
         at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
 
         at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
 
         at scala.Option.foreach(Option.scala:257)
 
         at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
 
         at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
 
         at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
 
         at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
 
         at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 
         at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
 
         at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
 
         at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
 
         at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
 
         at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:441)
 
         at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
 
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 
         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 
         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
         at java.lang.reflect.Method.invoke(Method.java:498)
 
         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
 
         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
 
         at py4j.Gateway.invoke(Gateway.java:280)
 
         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 
         at py4j.commands.CallCommand.execute(CallCommand.java:79)
 
         at py4j.GatewayConnection.run(GatewayConnection.java:214)
 
         at java.lang.Thread.run(Thread.java:745)
 
  Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block
0 in file hdfs://xxx/20170719/part-00000-3a9c226f-4fef-44b8-996b-115a2408c746.snappy.parquet
 
         at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
 
         at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
 
         at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
 
         at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
 
         at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:166)
 
         at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
 
         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
 
         at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 
         at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
 
         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
 
         at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
 
         at scala.collection.Iterator$class.foreach(Iterator.scala:893)
 
         at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
 
         at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
 
         at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
 
         at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
 
         at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
 
  Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny
cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong
 
         at org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.setLong(SpecificInternalRow.scala:295)
 
         at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setLong(ParquetRowConverter.scala:164)
 
         at org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addLong(ParquetRowConverter.scala:86)
 
         at org.apache.parquet.column.impl.ColumnReaderImpl$2$4.writeValue(ColumnReaderImpl.java:274)
 
         at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:371)
 
         at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
 
         at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
 
         ... 16 more
 
 
 
   
 
   <part-00000-17262890-a56c-42e2-b299-2b10354da184.snappy.parquet>
 
    <part-00000-3a9c226f-4fef-44b8-996b-115a2408c746.snappy.parquet>
 
    
 
  ---------------------------------------------------------------------
 
  To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Mime
View raw message