spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Gilmore <dragoncu...@gmail.com>
Subject Re: Parquet schema changes
Date Mon, 05 Jan 2015 05:44:22 GMT
I saw that in the source, which is why I was wondering.

I was mainly reading:

http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/

"A query that tries to parse the organizationId and userId from the 2
logTypes should be able to do so correctly, though they are positioned
differently in the schema. With Parquet, it’s not a problem. It will merge
‘A’ and ‘V’ schemas and project columns accordingly. It does so by
maintaining a file schema in addition to merged schema and parsing the
columns by referencing the 2."

I know that each part file can have its own schema, but I saw in the
implementation for Spark, if there was no metadata file, it'd just pick the
first file and use that schema across the board.  I'm not quite sure how
other implementations like Impala etc. deal with this, but I was really
hoping there'd be a way to "version" the schema as new records are added
and just project it through.

Would be a godsend for semi-structured data.

On Tue, Dec 23, 2014 at 3:33 PM, Cheng Lian <lian.cs.zju@gmail.com> wrote:

>  I must missed something important here, could you please provide more
> clue on Parquet “schema versioning”? I wasn’t aware of this feature (which
> sounds really useful).
>
> Especially, are you referring the following scenario:
>
>    1. Write some data whose schema is A to “t.parquet”, resulting a file
>    “t.parquet/parquet-r-1.part” on HDFS
>    2. Append more data whose schema B “contains” A, but has more columns
>    to “t.parquet”, resulting another file “t.parquet/parquet-r-2.part” on HDFS
>    3. Now read “t.parquet”, and schema A and B are expected to be merged
>
> If this is the case, then current Spark SQL doesn’t support this. We
> assume schemas of all data within a single Parquet file (which is an HDFS
> directory with multiple part-files) are identical.
>
> On 12/22/14 1:11 PM, Adam Gilmore wrote:
>
>    Hi all,
>
>  I understand that parquet allows for schema versioning automatically in
> the format; however, I'm not sure whether Spark supports this.
>
>  I'm saving a SchemaRDD to a parquet file, registering it as a table,
> then doing an insertInto with a SchemaRDD with an extra column.
>
>  The second SchemaRDD does in fact get inserted, but the extra column
> isn't present when I try to query it with Spark SQL.
>
>  Is there anything I can do to get this working how I'm hoping?
>
>   ​
>

Mime
View raw message