spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerrick Hoang <jerrickho...@gmail.com>
Subject Refresh table
Date Tue, 11 Aug 2015 06:14:54 GMT
Hi all,

I'm a little confused about how refresh table (SPARK-5833) should work. So
I did the following,

val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")

df1.write.parquet("hdfs://<path>/test_table/key=1")


Then I created an external table by doing,

CREATE EXTERNAL TABLE `tmp_table` (
`single`: int,
`double`: int)
PARTITIONED BY (
  `key` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://<path>/test_table/'

Then I added the partition to the table by `alter table tmp_table add
partition (key=1) location 'hdfs://..`

Then I added a new partition with different schema by,

val df2 = sc.makeRDD(1 to 5).map(i => (i, i * 3)).toDF("single", "triple")

df2.write.parquet("hdfs://<path>/test_table/key=2")


And added the new partition to the table by `alter table ..`,

But when I did `refresh table tmp_table` and `describe table` it couldn't
pick up the new column `triple`. Can someone explain to me how partition
discovery and schema merging of refresh table should work?

Thanks

Mime
View raw message