drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karthik tunga <karthik.tu...@gmail.com>
Subject Re: Drill native format
Date Fri, 14 Sep 2012 22:08:37 GMT
Hi,

This paper (http://arxiv.org/pdf/1105.4252.pdf) has column oriented (one
file per column) vs RCFile.
They use skip list and lazy record construction.

Cheers,
Karthik

On 14 September 2012 17:15, Amir Youssefi <amir.youssefi@gmail.com> wrote:

> "Nested data is not yet implemented" in BigQuery (if I recall exact words
> correctly). Quoting speaker at the BigQuery presentation at Google
> Technology User Group last week in Googleplex (intentionally not citing
> speaker's name).
>
> -ay
>
> On Sep 14, 2012, at 1:28 PM, David Gruzman <david@bigdatacraft.com> wrote:
>
> > I assume that evolution of BigQuery reflects resolution of Dremel... If
> > somebody have information on it it would be great.
> > Storage system should understand that all file comprising the horizontal
> > partition of the table are one logical entity, and should store them
> > together / in some proximity. I agree that PAX will be much more
> > convinient. The question is - is there performance penalty of PAX vs file
> > per column?
> > David
> >
> > On Fri, Sep 14, 2012 at 11:21 PM, Tomer Shiran <tshiran@maprtech.com>
> wrote:
> >
> >> Is there any public information suggesting that Google moved away from
> >> supporting nested data? Clearly BigQuery doesn't yet allow nested data,
> but
> >> not sure that applies to Dremel.
> >>
> >> There are challenges with one file per column. How do you ensure that a
> >> single record is located on a single machine to avoid costly record
> >> reconstruction?
> >>
> >> On Fri, Sep 14, 2012 at 1:05 PM, David Gruzman <david@bigdatacraft.com
> >>> wrote:
> >>
> >>> Hi All,
> >>> I would like to discuss the question of what will be native format for
> >>> drill. Original Google dremel paper defined their hierarchical columnar
> >>> data format. Since then
> >>> google shifted from hierarchical data format... So it is a question if
> it
> >>> makes sense to stick with it?
> >>> If we are also moving to simple flat format we need our own format we
> >> have
> >>> to support "native". In case of Drill I would define that native
> support
> >> as
> >>> "high performance".
> >>> I think we can go to some kind of PAX format with comprehensive
> metadata
> >> in
> >>> the header, so each file is completely self contained and can be
> >> understood
> >>> and processed without any external data.
> >>> Alternative is to have single file per column. As far as I remember
> from
> >>> our OpenDremel work the main decision point is - if we can read one
> >> column
> >>> from the  file without loading into node memory unnecessary data from
> >> other
> >>> columns.
> >>> With best regards,
> >>> David
> >>>
> >>
> >>
> >>
> >> --
> >> Tomer Shiran
> >> Director of Product Management | MapR Technologies | 650-804-8657
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message