spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bipin Nag <bipin....@gmail.com>
Subject Re: Error in using saveAsParquetFile
Date Mon, 08 Jun 2015 13:15:11 GMT
Hi Jeetendra, Cheng

I am using following code for joining

val Bookings = sqlContext.load("/home/administrator/stageddata/Bookings")
val Customerdetails =
sqlContext.load("/home/administrator/stageddata/Customerdetails")

val CD = Customerdetails.
    where($"CreatedOn" > "2015-04-01 00:00:00.0").
    where($"CreatedOn" < "2015-05-01 00:00:00.0")

//Bookings by CD
val r1 = Bookings.
    withColumnRenamed("ID","ID2")
val r2 = CD.
    join(r1,CD.col("CustomerID") === r1.col("ID2"),"left")

r2.saveAsParquetFile("/home/administrator/stageddata/BOOKING_FULL");

@Cheng I am not appending the joined table to an existing parquet file, it
is a new file.
@Jitender I have a rather large parquet file and it also contains some
confidential data. Can you tell me what you need to check in it.

Thanks


On 8 June 2015 at 16:47, Jeetendra Gangele <gangele397@gmail.com> wrote:

> Parquet file when are you loading these file?
> can you please share the code where you are passing parquet file to spark?.
>
> On 8 June 2015 at 16:39, Cheng Lian <lian.cs.zju@gmail.com> wrote:
>
>> Are you appending the joined DataFrame whose PolicyType is string to an
>> existing Parquet file whose PolicyType is int? The exception indicates that
>> Parquet found a column with conflicting data types.
>>
>> Cheng
>>
>>
>> On 6/8/15 5:29 PM, bipin wrote:
>>
>>> Hi I get this error message when saving a table:
>>>
>>> parquet.io.ParquetDecodingException: The requested schema is not
>>> compatible
>>> with the file schema. incompatible types: optional binary PolicyType
>>> (UTF8)
>>> != optional int32 PolicyType
>>>         at
>>>
>>> parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:105)
>>>         at
>>>
>>> parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:97)
>>>         at parquet.schema.PrimitiveType.accept(PrimitiveType.java:386)
>>>         at
>>>
>>> parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:87)
>>>         at
>>>
>>> parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:61)
>>>         at parquet.schema.MessageType.accept(MessageType.java:55)
>>>         at
>>> parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:148)
>>>         at
>>> parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:137)
>>>         at
>>> parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:157)
>>>         at
>>>
>>> parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:107)
>>>         at
>>>
>>> parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)
>>>         at
>>> parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
>>>         at
>>>
>>> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
>>>         at
>>>
>>> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
>>>         at
>>> org.apache.spark.sql.parquet.ParquetRelation2.org
>>> $apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:667)
>>>         at
>>>
>>> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
>>>         at
>>>
>>> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
>>>         at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>>>         at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>>         at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>>         at
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>         at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>         at java.lang.Thread.run(Thread.java:745)
>>>
>>> I joined two tables both loaded from parquet file, the joined table when
>>> saved throws this error. I could not find anything about this error.
>>> Could
>>> this be a bug ?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Error-in-using-saveAsParquetFile-tp23204.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>
> --
> Hi,
>
> Find my attached resume. I have total around 7 years of work experience.
> I worked for Amazon and Expedia in my previous assignments and currently I
> am working with start- up technology company called Insideview in hyderabad.
>
> Regards
> Jeetendra
>

Mime
View raw message