spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive
Date Fri, 25 Sep 2015 21:42:55 GMT
Please set the the SQL option spark.sql.parquet.binaryAsString to true 
when reading Parquet files containing strings generated by Hive.

This is actually a bug of parquet-hive. When generating Parquet schema 
for a string field, Parquet requires a "UTF8" annotation, something like:

message hive_schema {
   ...
   optional binary column2 (UTF8);
   ...
}

but parquet-hive fails to add it, and produces:

message hive_schema {
   ...
   optional binary column2;
   ...
}

Thus binary fields and string fields are made indistinguishable.

Interestingly, there's another bug in parquet-thrift, which always adds 
UTF8 annotation to all binary fields :)

Cheng

On 9/25/15 2:03 PM, java8964 wrote:
> Hi, Spark Users:
>
> I have a problem related to Spark cannot recognize the string type in 
> the Parquet schema generated by Hive.
>
> Version of all components:
>
> Spark 1.3.1
> Hive 0.12.0
> Parquet 1.3.2
>
> I generated a detail low level table in the Parquet format using 
> MapReduce java code. This table can be read in the Hive and Spark 
> without any issue.
>
> Now I create a Hive aggregation table like following:
>
> create external table T (
>     column1 bigint,
> *    column2 string,*
>     ..............
> )
> partitioned by (dt string)
> ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
> STORED AS
> INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
> OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
> location '/hdfs_location'
>
> Then the table is populated in the Hive by:
>
> set hive.exec.compress.output=true;
> set parquet.compression=snappy;
>
> insert into table T partition(dt='2015-09-23')
> select
>     .............
> from Detail_Table
> group by
>
> After this, we can query the T table in the Hive without issue.
>
> But if I try to use it in the Spark 1.3.1 like following:
>
> import org.apache.spark.sql.SQLContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val v_event_cnt=sqlContext.parquetFile("/hdfs_location/dt=2015-09-23")
>
> scala> v_event_cnt.printSchema
> root
>  |-- column1: long (nullable = true)
> * |-- column2: binary (nullable = true)*
>  |-- ............
>  |-- dt: string (nullable = true)
>
> The Spark will recognize column2 as binary type, instead of string 
> type in this case, but in the Hive, it works fine.
> So this bring an issue that in the Spark, the data will be dumped as 
> "[B@e353d68". To use it in the Spark, I have to cast it as string, to 
> get the correct value out of it.
>
> I wonder this mismatch type of Parquet file could be caused by which 
> part? Is the Hive not generate the correct Parquet file with schema, 
> or Spark in fact cannot recognize it due to problem in it.
>
> Is there a way I can do either Hive or Spark to make this parquet 
> schema correctly on both ends?
>
> Thanks
>
> Yong


Mime
View raw message