spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Error in using saveAsParquetFile
Date Tue, 09 Jun 2015 07:38:50 GMT
Yeah, this does look confusing. We are trying to improve the error 
reporting by catching similar issues at the end of the analysis phase 
and give more descriptive error messages. Part of the work can be found 
here: 
https://github.com/apache/spark/blob/0902a11940e550e85a53e110b490fe90e16ddaf4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

Cheng

On 6/9/15 2:51 PM, Bipin Nag wrote:
> Cheng you were right. It works when I remove the field from either 
> one. I should have checked the types beforehand. What confused me is 
> that Spark attempted to join it and midway threw the error. It isn't 
> quite there yet. Thanks for the help.
>
> On Mon, Jun 8, 2015 at 8:29 PM Cheng Lian <lian.cs.zju@gmail.com 
> <mailto:lian.cs.zju@gmail.com>> wrote:
>
>     I suspect that Bookings and Customerdetails both have a PolicyType
>     field, one is string and the other is an int.
>
>
>     Cheng
>
>
>     On 6/8/15 9:15 PM, Bipin Nag wrote:
>>     Hi Jeetendra, Cheng
>>
>>     I am using following code for joining
>>
>>     val Bookings =
>>     sqlContext.load("/home/administrator/stageddata/Bookings")
>>     val Customerdetails =
>>     sqlContext.load("/home/administrator/stageddata/Customerdetails")
>>
>>     val CD = Customerdetails.
>>         where($"CreatedOn" > "2015-04-01 00:00:00.0").
>>         where($"CreatedOn" < "2015-05-01 00:00:00.0")
>>
>>     //Bookings by CD
>>     val r1 = Bookings.
>>         withColumnRenamed("ID","ID2")
>>     val r2 = CD.
>>         join(r1,CD.col("CustomerID") === r1.col("ID2"),"left")
>>
>>     r2.saveAsParquetFile("/home/administrator/stageddata/BOOKING_FULL");
>>
>>     @Cheng I am not appending the joined table to an existing parquet
>>     file, it is a new file.
>>     @Jitender I have a rather large parquet file and it also contains
>>     some confidential data. Can you tell me what you need to check in it.
>>
>>     Thanks
>>
>>
>>     On 8 June 2015 at 16:47, Jeetendra Gangele <gangele397@gmail.com
>>     <mailto:gangele397@gmail.com>> wrote:
>>
>>         Parquet file when are you loading these file?
>>         can you please share the code where you are passing parquet
>>         file to spark?.
>>
>>         On 8 June 2015 at 16:39, Cheng Lian <lian.cs.zju@gmail.com
>>         <mailto:lian.cs.zju@gmail.com>> wrote:
>>
>>             Are you appending the joined DataFrame whose PolicyType
>>             is string to an existing Parquet file whose PolicyType is
>>             int? The exception indicates that Parquet found a column
>>             with conflicting data types.
>>
>>             Cheng
>>
>>
>>             On 6/8/15 5:29 PM, bipin wrote:
>>
>>                 Hi I get this error message when saving a table:
>>
>>                 parquet.io
>>                 <http://parquet.io>.ParquetDecodingException: The
>>                 requested schema is not compatible
>>                 with the file schema. incompatible types: optional
>>                 binary PolicyType (UTF8)
>>                 != optional int32 PolicyType
>>                         at
>>                 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:105)
>>                         at
>>                 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:97)
>>                         at
>>                 parquet.schema.PrimitiveType.accept(PrimitiveType.java:386)
>>                         at
>>                 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:87)
>>                         at
>>                 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:61)
>>                         at
>>                 parquet.schema.MessageType.accept(MessageType.java:55)
>>                         at
>>                 parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:148)
>>                         at
>>                 parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:137)
>>                         at
>>                 parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:157)
>>                         at
>>                 parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:107)
>>                         at
>>                 parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)
>>                         at
>>                 parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
>>                         at
>>                 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
>>                         at
>>                 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
>>                         at
>>                 org.apache.spark.sql.parquet.ParquetRelation2.org
>>                 <http://org.apache.spark.sql.parquet.ParquetRelation2.org>$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:667)
>>                         at
>>                 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
>>                         at
>>                 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
>>                         at
>>                 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>>                         at
>>                 org.apache.spark.scheduler.Task.run(Task.scala:64)
>>                         at
>>                 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>                         at
>>                 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>                         at
>>                 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>                         at java.lang.Thread.run(Thread.java:745)
>>
>>                 I joined two tables both loaded from parquet file,
>>                 the joined table when
>>                 saved throws this error. I could not find anything
>>                 about this error. Could
>>                 this be a bug ?
>>
>>
>>
>>                 --
>>                 View this message in context:
>>                 http://apache-spark-user-list.1001560.n3.nabble.com/Error-in-using-saveAsParquetFile-tp23204.html
>>                 Sent from the Apache Spark User List mailing list
>>                 archive at Nabble.com.
>>
>>                 ---------------------------------------------------------------------
>>                 To unsubscribe, e-mail:
>>                 user-unsubscribe@spark.apache.org
>>                 <mailto:user-unsubscribe@spark.apache.org>
>>                 For additional commands, e-mail:
>>                 user-help@spark.apache.org
>>                 <mailto:user-help@spark.apache.org>
>>
>>
>>
>>
>>             ---------------------------------------------------------------------
>>             To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>             <mailto:user-unsubscribe@spark.apache.org>
>>             For additional commands, e-mail:
>>             user-help@spark.apache.org
>>             <mailto:user-help@spark.apache.org>
>>
>>
>>
>>
>>         -- 
>>         Hi,
>>
>>         Find my attached resume. I have total around 7 years of work
>>         experience.
>>         I worked for Amazon and Expedia in my previous assignments
>>         and currently I am working with start- up technology company
>>         called Insideview in hyderabad.
>>
>>         Regards
>>         Jeetendra
>>
>>
>


Mime
View raw message