spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Parquet schema changes
Date Tue, 23 Dec 2014 05:33:36 GMT
I must missed something important here, could you please provide more 
clue on Parquet “schema versioning”? I wasn’t aware of this feature 
(which sounds really useful).

Especially, are you referring the following scenario:

 1. Write some data whose schema is A to “t.parquet”, resulting a file
    “t.parquet/parquet-r-1.part” on HDFS
 2. Append more data whose schema B “contains” A, but has more columns
    to “t.parquet”, resulting another file “t.parquet/parquet-r-2.part”
    on HDFS
 3. Now read “t.parquet”, and schema A and B are expected to be merged

If this is the case, then current Spark SQL doesn’t support this. We 
assume schemas of all data within a single Parquet file (which is an 
HDFS directory with multiple part-files) are identical.

On 12/22/14 1:11 PM, Adam Gilmore wrote:

> Hi all,
>
> I understand that parquet allows for schema versioning automatically 
> in the format; however, I'm not sure whether Spark supports this.
>
> I'm saving a SchemaRDD to a parquet file, registering it as a table, 
> then doing an insertInto with a SchemaRDD with an extra column.
>
> The second SchemaRDD does in fact get inserted, but the extra column 
> isn't present when I try to query it with Spark SQL.
>
> Is there anything I can do to get this working how I'm hoping?

​

Mime
View raw message