spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive
Date Fri, 25 Sep 2015 21:48:38 GMT
BTW, just checked that this bug should have been fixed since Hive 
0.14.0. So the SQL option I mentioned is mostly used for reading legacy 
Parquet files generated by older versions of Hive.

Cheng

On 9/25/15 2:42 PM, Cheng Lian wrote:
> Please set the the SQL option spark.sql.parquet.binaryAsString to true 
> when reading Parquet files containing strings generated by Hive.
>
> This is actually a bug of parquet-hive. When generating Parquet schema 
> for a string field, Parquet requires a "UTF8" annotation, something like:
>
> message hive_schema {
>   ...
>   optional binary column2 (UTF8);
>   ...
> }
>
> but parquet-hive fails to add it, and produces:
>
> message hive_schema {
>   ...
>   optional binary column2;
>   ...
> }
>
> Thus binary fields and string fields are made indistinguishable.
>
> Interestingly, there's another bug in parquet-thrift, which always 
> adds UTF8 annotation to all binary fields :)
>
> Cheng
>
> On 9/25/15 2:03 PM, java8964 wrote:
>> Hi, Spark Users:
>>
>> I have a problem related to Spark cannot recognize the string type in 
>> the Parquet schema generated by Hive.
>>
>> Version of all components:
>>
>> Spark 1.3.1
>> Hive 0.12.0
>> Parquet 1.3.2
>>
>> I generated a detail low level table in the Parquet format using 
>> MapReduce java code. This table can be read in the Hive and Spark 
>> without any issue.
>>
>> Now I create a Hive aggregation table like following:
>>
>> create external table T (
>>     column1 bigint,
>> *    column2 string,*
>>     ..............
>> )
>> partitioned by (dt string)
>> ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
>> STORED AS
>> INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
>> OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
>> location '/hdfs_location'
>>
>> Then the table is populated in the Hive by:
>>
>> set hive.exec.compress.output=true;
>> set parquet.compression=snappy;
>>
>> insert into table T partition(dt='2015-09-23')
>> select
>>     .............
>> from Detail_Table
>> group by
>>
>> After this, we can query the T table in the Hive without issue.
>>
>> But if I try to use it in the Spark 1.3.1 like following:
>>
>> import org.apache.spark.sql.SQLContext
>> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> val v_event_cnt=sqlContext.parquetFile("/hdfs_location/dt=2015-09-23")
>>
>> scala> v_event_cnt.printSchema
>> root
>>  |-- column1: long (nullable = true)
>> * |-- column2: binary (nullable = true)*
>>  |-- ............
>>  |-- dt: string (nullable = true)
>>
>> The Spark will recognize column2 as binary type, instead of string 
>> type in this case, but in the Hive, it works fine.
>> So this bring an issue that in the Spark, the data will be dumped as 
>> "[B@e353d68". To use it in the Spark, I have to cast it as string, to 
>> get the correct value out of it.
>>
>> I wonder this mismatch type of Parquet file could be caused by which 
>> part? Is the Hive not generate the correct Parquet file with schema, 
>> or Spark in fact cannot recognize it due to problem in it.
>>
>> Is there a way I can do either Hive or Spark to make this parquet 
>> schema correctly on both ends?
>>
>> Thanks
>>
>> Yong
>


Mime
View raw message