spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Change when loading/storing String data using Parquet
Date Mon, 14 Jul 2014 22:55:59 GMT
I just wanted to send out a quick note about a change in the handling of
strings when loading / storing data using parquet and Spark SQL.  Before,
Spark SQL did not support binary data in Parquet, so all binary blobs were
implicitly treated as Strings.  9fe693
<https://github.com/apache/spark/commit/9fe693b5b6ed6af34ee1e800ab89c8a11991ea38>
fixes
this limitation by adding support for binary data.

However, data written out with a prior version of Spark SQL will be missing
the annotation telling us to interpret a given column as a String, so old
string data will now be loaded as binary data.  If you would like to use
the data as a string, you will need to add a CAST to convert the datatype.

New string data written out after this change, will correctly be loaded in
as a string as now we will include an annotation about the desired type.
 Additionally, this should now interoperate correctly with other systems
that write Parquet data (hive, thrift, etc).

Michael

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message