spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hamish Whittal <ham...@cloud-fundis.co.za>
Subject [No Subject]
Date Sun, 01 Mar 2020 21:56:56 GMT
Hi there,

I have an hdfs directory with thousands of files. It seems that some of
them - and I don't know which ones - have a problem with their schema and
it's causing my Spark application to fail with this error:

Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet
column cannot be converted in file hdfs://
ip-172-24-89-229.blaah.com:8020/user/hadoop/origdata/part-00000-8b83989a-e387-4f64-8ac5-22b16770095e-c000.snappy.parquet.
Column: [price], Expected: double, Found: FIXED_LEN_BYTE_ARRAY

The problem is not only that it's causing the application to fail, but
every time if does fail, I have to copy that file out of the directory and
start the app again.

I thought of trying to use try-except, but I can't seem to get that to work.

Is there any advice anyone can give me because I really can't see myself
going through thousands of files trying to figure out which ones are broken.

Thanks in advance,

hamish

Mime
View raw message