drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefán Baxter <ste...@activitystream.com>
Subject Re: Parquet drill date fields
Date Thu, 04 Feb 2016 23:51:28 GMT
thnx, will do

On Thu, Feb 4, 2016 at 11:49 PM, Jason Altekruse <altekrusejason@gmail.com>
wrote:

> We haven't turned on the 2.0 encodings in Drill's Parquet writer, so they
> have not been thoroughly tested. That being said we do use the standard
> parquet-mr interfaces for reading parquet files in our complex parquet
> reader. We are currently depending on 1.8.1 in Drill, so it should be
> compatible.
>
> I think it would be safest to run with `store.parquet.use_new_reader` set
> to true if you were going to working with parquet 2.0 files right now.
>
> - Jason
>
> On Thu, Feb 4, 2016 at 3:40 PM, Stefán Baxter <stefan@activitystream.com>
> wrote:
>
> > OK, the automatic handling and encoding options improve a lot in Parquet
> > 2.0. (Manual override is not an option)
> >
> > I'm using parquet-mr/parquet-avro to create parquet 2 files
> > (ParquetProperties.WriterVersion.PARQUET_2_0).
> >
> > Drill seems to read them just fine but I wonder if there are any gotchas
> >
> > Regards,
> >  -Stefán
> >
> >
> > On Thu, Feb 4, 2016 at 4:51 PM, Stefán Baxter <stefan@activitystream.com
> >
> > wrote:
> >
> > > Hi again,
> > >
> > > I did a little test and ~5 million fairly wide records take 791 MB in
> > > parquet without dictionary encoding and 550MB with dictionary encoding
> > > enabled (The non-dictionary encoded file is a whooping 45% bigger).
> > > The plain, non-dictionary-encoding, file returns results for identical
> > > queries in ~20% less time than the one that uses dictionary encoding.
> > >
> > > Regards,
> > >  -Stefán
> > >
> > >
> > >
> > > On Thu, Feb 4, 2016 at 3:48 PM, Stefán Baxter <
> stefan@activitystream.com
> > >
> > > wrote:
> > >
> > >> Hi Jason,
> > >>
> > >> Thank you for the explanation.
> > >>
> > >> I have several *low* cardinality fields that contain semi-long values
> > and
> > >> they are, I think, a perfect candidate for dictionary encoding.
> > >>
> > >> I assumed that the choose to use dictionary encoding was a bit smarter
> > >> than this and would rely on Strings type column where x% repeated
> values
> > >> were a clear signal for it's use.
> > >>
> > >> If you can outline what  needs to be done and where then I will gladly
> > >> take a stab at it :).
> > >>
> > >> Several questions along those lines:
> > >>
> > >>    - Does the Parquet library that Drill uses allow for programmatic
> > >>    section?
> > >>    - What metadata, regarding the column content, is available when
> the
> > >>    choice is made?
> > >>    - Where in the Parquet part of Drill is this logic?
> > >>    - Is there no ongoing effort in parquet-mr to make the automatic
> > >>    handling smarter?
> > >>    - Are all Parquet encoding options being used by drill?
> > >>    - Like the encoding of longs where delta between semi-subsequent
> > >>    numbers is stored. (As I understand it)
> > >>
> > >> thanks again.
> > >>
> > >> Regards,
> > >>  -Stefan
> > >>
> > >>
> > >>
> > >>
> > >> On Thu, Feb 4, 2016 at 3:36 PM, Jason Altekruse <
> > altekrusejason@gmail.com
> > >> > wrote:
> > >>
> > >>> Hi Stefan,
> > >>>
> > >>> There is a reason that dictionary is disabled by default. The
> > parquet-mr
> > >>> library we leverage for writing parquet files currently has the
> > behavior
> > >>> to
> > >>> write nearly all columns as dictionary encoded for all types when
> > >>> dictionary encoding is enabled. This includes columns with integers,
> > >>> doubles, dates and timestamps.
> > >>>
> > >>> Do you have some data that you believe is well suited for dictionary
> > >>> encoding in the dataset? I think there are good uses for it, such as
> > data
> > >>> coming from systems that support enumerations, that might be
> > represented
> > >>> as
> > >>> strings when exported from a database for use with Big Data tools
> like
> > >>> Drill. Unfortunately we do not currently provide a mechanism for
> > >>> requesting
> > >>> dictionary encoding on only some columns, and we don't do anything
> like
> > >>> buffer values to determine if a given column is well-suited for
> > >>> dictionary
> > >>> encoding before starting to write them.
> > >>>
> > >>> In many cases it obviously is not a good choice, and so we actually
> > take
> > >>> a
> > >>> performance hit re-materializing the data out of the dictionary upon
> > >>> read.
> > >>>
> > >>> If you would be interested in trying to contribute such an
> enhancement
> > I
> > >>> would be willing to help you get started with it.
> > >>>
> > >>> - Jason
> > >>>
> > >>> On Wed, Feb 3, 2016 at 5:15 AM, Stefán Baxter <
> > stefan@activitystream.com
> > >>> >
> > >>> wrote:
> > >>>
> > >>> > Hi,
> > >>> >
> > >>> > I'm converting Avro to parquest and I'm getting this log entry
back
> > >>> for a
> > >>> > timestamp field:
> > >>> >
> > >>> > Written 1,008,842B for [occurred_at] INT64: 591,435 values,
> > 2,169,557B
> > >>> raw,
> > >>> > 1,008,606B comp, 5 pages, encodings: [BIT_PACKED, PLAIN,
> > >>> PLAIN_DICTIONARY,
> > >>> > RLE], dic { 123,832 entries, 990,656B raw, 123,832B comp}
> > >>> >
> > >>> > Can someone please tell me if this is the expected encoding for
a
> > >>> timestamp
> > >>> > field.
> > >>> >
> > >>> > I'm a bit surprised that it seems to be dictionary based. (Yes,
I
> > have
> > >>> > enabled dictionary encoding for Parquet files).
> > >>> >
> > >>> > Regards,
> > >>> >  -Stefán
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message