spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Gilmore <dragoncu...@gmail.com>
Subject Re: Parquet schema changes
Date Wed, 07 Jan 2015 09:00:09 GMT
Fantastic - glad to see that it's in the pipeline!

On Wed, Jan 7, 2015 at 11:27 AM, Michael Armbrust <michael@databricks.com>
wrote:

> I want to support this but we don't yet.  Here is the JIRA:
> https://issues.apache.org/jira/browse/SPARK-3851
>
> On Tue, Jan 6, 2015 at 5:23 PM, Adam Gilmore <dragoncurve@gmail.com>
> wrote:
>
>> Anyone got any further thoughts on this?  I saw the _metadata file seems
>> to store the schema of every single part (i.e. file) in the parquet
>> directory, so in theory it should be possible.
>>
>> Effectively, our use case is that we have a stack of JSON that we receive
>> and we want to encode to Parquet for high performance, but there is
>> potential of new fields being added to the JSON structure, so we want to be
>> able to handle that every time we encode to Parquet (we'll be doing it
>> "incrementally" for performance).
>>
>> On Mon, Jan 5, 2015 at 3:44 PM, Adam Gilmore <dragoncurve@gmail.com>
>> wrote:
>>
>>> I saw that in the source, which is why I was wondering.
>>>
>>> I was mainly reading:
>>>
>>> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/
>>>
>>> "A query that tries to parse the organizationId and userId from the 2
>>> logTypes should be able to do so correctly, though they are positioned
>>> differently in the schema. With Parquet, it’s not a problem. It will merge
>>> ‘A’ and ‘V’ schemas and project columns accordingly. It does so by
>>> maintaining a file schema in addition to merged schema and parsing the
>>> columns by referencing the 2."
>>>
>>> I know that each part file can have its own schema, but I saw in the
>>> implementation for Spark, if there was no metadata file, it'd just pick the
>>> first file and use that schema across the board.  I'm not quite sure how
>>> other implementations like Impala etc. deal with this, but I was really
>>> hoping there'd be a way to "version" the schema as new records are added
>>> and just project it through.
>>>
>>> Would be a godsend for semi-structured data.
>>>
>>> On Tue, Dec 23, 2014 at 3:33 PM, Cheng Lian <lian.cs.zju@gmail.com>
>>> wrote:
>>>
>>>>  I must missed something important here, could you please provide more
>>>> clue on Parquet “schema versioning”? I wasn’t aware of this feature
(which
>>>> sounds really useful).
>>>>
>>>> Especially, are you referring the following scenario:
>>>>
>>>>    1. Write some data whose schema is A to “t.parquet”, resulting a
>>>>    file “t.parquet/parquet-r-1.part” on HDFS
>>>>    2. Append more data whose schema B “contains” A, but has more
>>>>    columns to “t.parquet”, resulting another file “t.parquet/parquet-r-2.part”
>>>>    on HDFS
>>>>    3. Now read “t.parquet”, and schema A and B are expected to be
>>>>    merged
>>>>
>>>> If this is the case, then current Spark SQL doesn’t support this. We
>>>> assume schemas of all data within a single Parquet file (which is an HDFS
>>>> directory with multiple part-files) are identical.
>>>>
>>>> On 12/22/14 1:11 PM, Adam Gilmore wrote:
>>>>
>>>>    Hi all,
>>>>
>>>>  I understand that parquet allows for schema versioning automatically
>>>> in the format; however, I'm not sure whether Spark supports this.
>>>>
>>>>  I'm saving a SchemaRDD to a parquet file, registering it as a table,
>>>> then doing an insertInto with a SchemaRDD with an extra column.
>>>>
>>>>  The second SchemaRDD does in fact get inserted, but the extra column
>>>> isn't present when I try to query it with Spark SQL.
>>>>
>>>>  Is there anything I can do to get this working how I'm hoping?
>>>>
>>>>   ​
>>>>
>>>
>>>
>>
>

Mime
View raw message