spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hamish Whittal <>
Subject Still incompatible schemas
Date Mon, 09 Mar 2020 07:57:04 GMT
Hi folks,

Thanks for the help thus far.

I'm trying to track down the source of this error:


w hen doing a

Basically I'm reading in a single Parquet file (to try to narrow things

I'm defining the schema in the beginning and loading the parquet with:
   message = spark\
             .option("mergeSchema", "true")\
             .option("badRecordsPath", "/tmp/badRecords/")\


[I've tried with and without the mergeSchema option]
[ sidenote: I was hoping the badRecordPath would help with the truly bad
records, but this seems to do nothing]

I've also tried to cast the potential problematic columns (so Int, Long,
Double, etc) with

  message_1 = message\
    .withColumn('price', col('price').cast('double'))\
    .withColumn('price_eur', col('price_eur').cast('double'))\
    .withColumn('cost_usd', col('cost_usd').cast('double'))\
    .withColumn('adapter_status', col('adapter_status').cast('long'))

Yet I get this error and I can't figure out:
(a) whether it's some record WITHIN the parquet file that's causing it and
(b) if it is a single record (or a few records) then how do I find those
particular records?

In the previous time I encountered this, there were records that should
have had doubles in them (like "price" above) that actually seemed to have

I did this to fix that particular problem:

if not 'price' in message.columns:
    message = message.withColumn('price', message.lit('0'))

Any suggestions or help would be MOST welcome. I have also tried using
pyarrow to take a look at the Parquet schema and it looks fine. I mean, it
doesn't look like the schema in the parquet is the problem - but of course
I'm not ruling that out just yet.

Thanks for any suggestions,

Cape Town, South Africa
+27 79 614 4913

View raw message