spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shmuel Blitz <shmuel.bl...@similarweb.com>
Subject Spark 2.1 table loaded from Hive Metastore has null values
Date Mon, 07 Aug 2017 10:05:16 GMT
(Also asked on SO at https://stackoverflow.com/q/45543140/416300)
I am trying to migrate table definitions from one Hive metastore to another.

The source cluster has:

   - Spark 1.6.0
   - Hive 1.1.0 (cdh)
   - HDFS


The destination cluster is an EMR cluster with:

   - Spark 2.1.1
   - Hive 2.1.1
   - S3


To migrate the tables I did the following:
  1. Copy data from HDFS to S3
  2. Run SHOW CREATE TABLE my_table; in the source cluster
  3. Modify the returned create query - change LOCATION from the HDFS path
to the S3 path
  4. Run the modified query on the destination cluster's Hive
  5. Run SELECT * FROM my_table;. This returns 0 rows (expected)
  6. Run MSCK REPAIR TABLE my_table;. This passes as expected and registers
the partitions in the metastore.
  7. Run SELECT * FROM my_table LIMIT 10; - 10 lines are returned with
correct values
  8. On the destination cluster, from Spark that is configured to work with
the Hive Metastore, run the following code: spark.sql("SELECT * FROM
my_table limit 10").show() - This returns null values!

The result returned from the Spark SQL query has all the correct columns,
and the correct number of lines, but all the values are null.

To get Spark to correctly load the values, I can add the following
properties to the TBLPROPERTIES part of the create query:

'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.provider'='org.apache.spark.sql.parquet',
'spark.sql.sources.schema.numPartCols'='<partition-count>',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='<json-schema as seen by spark>'
'spark.sql.sources.schema.partCol.0'='<partition name 1>',
'spark.sql.sources.schema.partCol.1'='<partition name 2>',
...



The other side of this problem is that in the source cluster, Spark reads
the table values without any problem and without the extra TBLPROPERTIES.

Why is this happening? How can it be fixed?
-- 
[image: Logo]
<https://www.similarweb.com/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
Shmuel Blitz
*Big Data Developer*
www.similarweb.com
<http://www.similarweb.com?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<https://www.facebook.com/SimilarWeb/?fref=ts&utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
Like
Us
<https://www.facebook.com/SimilarWeb/?fref=ts&utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<https://twitter.com/SimilarWeb?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
Follow
Us
<https://twitter.com/SimilarWeb?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<https://www.youtube.com/watch?v=Sb09jaZYY7s&utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
Watch
Us
<https://www.youtube.com/watch?v=Sb09jaZYY7s&utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<https://www.similarweb.com/blog/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
Read
Us
<https://www.similarweb.com/blog/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

Mime
View raw message