spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <>
Subject Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive
Date Fri, 25 Sep 2015 21:03:28 GMT
Hi, Spark Users:
I have a problem related to Spark cannot recognize the string type in the Parquet schema generated
by Hive.
Version of all components:
Spark 1.3.1Hive 0.12.0Parquet 1.3.2
I generated a detail low level table in the Parquet format using MapReduce java code. This
table can be read in the Hive and Spark without any issue.
Now I create a Hive aggregation table like following:
create external table T (    column1 bigint,    column2 string,    ..............)partitioned
by (dt string)ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'STORED ASINPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"OUTPUTFORMAT
"parquet.hive.DeprecatedParquetOutputFormat"location '/hdfs_location'
Then the table is populated in the Hive by:
set hive.exec.compress.output=true;set parquet.compression=snappy;
insert into table T partition(dt='2015-09-23')select     .............from Detail_Tablegroup
After this, we can query the T table in the Hive without issue.
But if I try to use it in the Spark 1.3.1 like following:
import org.apache.spark.sql.SQLContextval sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)val
scala> v_event_cnt.printSchemaroot |-- column1: long (nullable = true) |-- column2: binary
(nullable = true) |-- ............ |-- dt: string (nullable = true)
The Spark will recognize column2 as binary type, instead of string type in this case, but
in the Hive, it works fine.So this bring an issue that in the Spark, the data will be dumped
as "[B@e353d68". To use it in the Spark, I have to cast it as string, to get the correct value
out of it.
I wonder this mismatch type of Parquet file could be caused by which part? Is the Hive not
generate the correct Parquet file with schema, or Spark in fact cannot recognize it due to
problem in it. 
Is there a way I can do either Hive or Spark to make this parquet schema correctly on both
View raw message