drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amir Youssefi <amir.youss...@gmail.com>
Subject Re: Drill native format
Date Fri, 14 Sep 2012 21:15:28 GMT
"Nested data is not yet implemented" in BigQuery (if I recall exact words correctly). Quoting
speaker at the BigQuery presentation at Google Technology User Group last week in Googleplex
(intentionally not citing speaker's name).


On Sep 14, 2012, at 1:28 PM, David Gruzman <david@bigdatacraft.com> wrote:

> I assume that evolution of BigQuery reflects resolution of Dremel... If
> somebody have information on it it would be great.
> Storage system should understand that all file comprising the horizontal
> partition of the table are one logical entity, and should store them
> together / in some proximity. I agree that PAX will be much more
> convinient. The question is - is there performance penalty of PAX vs file
> per column?
> David
> On Fri, Sep 14, 2012 at 11:21 PM, Tomer Shiran <tshiran@maprtech.com> wrote:
>> Is there any public information suggesting that Google moved away from
>> supporting nested data? Clearly BigQuery doesn't yet allow nested data, but
>> not sure that applies to Dremel.
>> There are challenges with one file per column. How do you ensure that a
>> single record is located on a single machine to avoid costly record
>> reconstruction?
>> On Fri, Sep 14, 2012 at 1:05 PM, David Gruzman <david@bigdatacraft.com
>>> wrote:
>>> Hi All,
>>> I would like to discuss the question of what will be native format for
>>> drill. Original Google dremel paper defined their hierarchical columnar
>>> data format. Since then
>>> google shifted from hierarchical data format... So it is a question if it
>>> makes sense to stick with it?
>>> If we are also moving to simple flat format we need our own format we
>> have
>>> to support "native". In case of Drill I would define that native support
>> as
>>> "high performance".
>>> I think we can go to some kind of PAX format with comprehensive metadata
>> in
>>> the header, so each file is completely self contained and can be
>> understood
>>> and processed without any external data.
>>> Alternative is to have single file per column. As far as I remember from
>>> our OpenDremel work the main decision point is - if we can read one
>> column
>>> from the  file without loading into node memory unnecessary data from
>> other
>>> columns.
>>> With best regards,
>>> David
>> --
>> Tomer Shiran
>> Director of Product Management | MapR Technologies | 650-804-8657

View raw message