spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hamish Whittal <ham...@cloud-fundis.co.za>
Subject Still incompatible schemas
Date Mon, 09 Mar 2020 07:57:04 GMT
Hi folks,

Thanks for the help thus far.

I'm trying to track down the source of this error:

java.lang.UnsupportedOperationException:
org.apache.parquet.column.values.dictionary.PlainValuesDictionary

w hen doing a message.show()

Basically I'm reading in a single Parquet file (to try to narrow things
down).

I'm defining the schema in the beginning and loading the parquet with:
   message = spark\
             .read\
             .schema(myMessageSchema)\
             .format("parquet")\
             .option("mergeSchema", "true")\
             .option("badRecordsPath", "/tmp/badRecords/")\

 .load("hdfs:///user/hadoop/feb20/part-00000-c6da95c9-9c40-4623-a5c5-851188e236ff-c000.snappy.parquet")

[I've tried with and without the mergeSchema option]
[ sidenote: I was hoping the badRecordPath would help with the truly bad
records, but this seems to do nothing]

I've also tried to cast the potential problematic columns (so Int, Long,
Double, etc) with

  message_1 = message\
    .withColumn('price', col('price').cast('double'))\
    .withColumn('price_eur', col('price_eur').cast('double'))\
    .withColumn('cost_usd', col('cost_usd').cast('double'))\
    .withColumn('adapter_status', col('adapter_status').cast('long'))

Yet I get this error and I can't figure out:
(a) whether it's some record WITHIN the parquet file that's causing it and
(b) if it is a single record (or a few records) then how do I find those
particular records?

In the previous time I encountered this, there were records that should
have had doubles in them (like "price" above) that actually seemed to have
null.

I did this to fix that particular problem:

if not 'price' in message.columns:
    message = message.withColumn('price', message.lit('0'))

Any suggestions or help would be MOST welcome. I have also tried using
pyarrow to take a look at the Parquet schema and it looks fine. I mean, it
doesn't look like the schema in the parquet is the problem - but of course
I'm not ruling that out just yet.

Thanks for any suggestions,

Hamish
-- 
Cloud-Fundis.co.za
Cape Town, South Africa
+27 79 614 4913

Mime
View raw message