spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zahid Rahman <zahidr1...@gmail.com>
Subject Re: Still incompatible schemas
Date Mon, 09 Mar 2020 09:42:33 GMT
*This issue of  has been  discussed resolved on this page *

*https://issues.apache.org/jira/browse/SPARK-17557
<https://issues.apache.org/jira/browse/SPARK-17557>*


*It is suggested  by one person that by simply reading the parquet file in
a different way as illustrated the error may go away. It appears to me you
are reading the parquet file using the command line. Perhaps if you try it
programmatically as suggested you may find resolution.*

*"** I encounter an issue when data resides in Hive as parquet format and
when trying to read from Spark (2.2.1), facing the above issue. I notice
that in my case there is date field (contains values as 2018, 2017) which
is written as integer. But when reading in spark as -*

*val df = spark.sql("SELECT * FROM db.table") *



*df.show(3, false) java.lang.UnsupportedOperationException:
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)*




*To my surprise when reading same data from s3 location as - val df =
spark.read.parquet("s3://path/file") df.show(3, false) // this displays the
results. "*


Backbutton.co.uk
¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org
<http://www.backbutton.co.uk>


On Mon, 9 Mar 2020 at 07:57, Hamish Whittal <hamish@cloud-fundis.co.za>
wrote:

> Hi folks,
>
> Thanks for the help thus far.
>
> I'm trying to track down the source of this error:
>
> java.lang.UnsupportedOperationException:
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary
>
> w hen doing a message.show()
>
> Basically I'm reading in a single Parquet file (to try to narrow things
> down).
>
> I'm defining the schema in the beginning and loading the parquet with:
>    message = spark\
>              .read\
>              .schema(myMessageSchema)\
>              .format("parquet")\
>              .option("mergeSchema", "true")\
>              .option("badRecordsPath", "/tmp/badRecords/")\
>
>  .load("hdfs:///user/hadoop/feb20/part-00000-c6da95c9-9c40-4623-a5c5-851188e236ff-c000.snappy.parquet")
>
> [I've tried with and without the mergeSchema option]
> [ sidenote: I was hoping the badRecordPath would help with the truly bad
> records, but this seems to do nothing]
>
> I've also tried to cast the potential problematic columns (so Int, Long,
> Double, etc) with
>
>   message_1 = message\
>     .withColumn('price', col('price').cast('double'))\
>     .withColumn('price_eur', col('price_eur').cast('double'))\
>     .withColumn('cost_usd', col('cost_usd').cast('double'))\
>     .withColumn('adapter_status', col('adapter_status').cast('long'))
>
> Yet I get this error and I can't figure out:
> (a) whether it's some record WITHIN the parquet file that's causing it and
> (b) if it is a single record (or a few records) then how do I find those
> particular records?
>
> In the previous time I encountered this, there were records that should
> have had doubles in them (like "price" above) that actually seemed to have
> null.
>
> I did this to fix that particular problem:
>
> if not 'price' in message.columns:
>     message = message.withColumn('price', message.lit('0'))
>
> Any suggestions or help would be MOST welcome. I have also tried using
> pyarrow to take a look at the Parquet schema and it looks fine. I mean, it
> doesn't look like the schema in the parquet is the problem - but of course
> I'm not ruling that out just yet.
>
> Thanks for any suggestions,
>
> Hamish
> --
> Cloud-Fundis.co.za
> Cape Town, South Africa
> +27 79 614 4913
>

Mime
View raw message