drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Altekruse <altekruseja...@gmail.com>
Subject Re: Parquet drill date fields
Date Thu, 04 Feb 2016 23:49:51 GMT
We haven't turned on the 2.0 encodings in Drill's Parquet writer, so they
have not been thoroughly tested. That being said we do use the standard
parquet-mr interfaces for reading parquet files in our complex parquet
reader. We are currently depending on 1.8.1 in Drill, so it should be
compatible.

I think it would be safest to run with `store.parquet.use_new_reader` set
to true if you were going to working with parquet 2.0 files right now.

- Jason

On Thu, Feb 4, 2016 at 3:40 PM, Stefán Baxter <stefan@activitystream.com>
wrote:

> OK, the automatic handling and encoding options improve a lot in Parquet
> 2.0. (Manual override is not an option)
>
> I'm using parquet-mr/parquet-avro to create parquet 2 files
> (ParquetProperties.WriterVersion.PARQUET_2_0).
>
> Drill seems to read them just fine but I wonder if there are any gotchas
>
> Regards,
>  -Stefán
>
>
> On Thu, Feb 4, 2016 at 4:51 PM, Stefán Baxter <stefan@activitystream.com>
> wrote:
>
> > Hi again,
> >
> > I did a little test and ~5 million fairly wide records take 791 MB in
> > parquet without dictionary encoding and 550MB with dictionary encoding
> > enabled (The non-dictionary encoded file is a whooping 45% bigger).
> > The plain, non-dictionary-encoding, file returns results for identical
> > queries in ~20% less time than the one that uses dictionary encoding.
> >
> > Regards,
> >  -Stefán
> >
> >
> >
> > On Thu, Feb 4, 2016 at 3:48 PM, Stefán Baxter <stefan@activitystream.com
> >
> > wrote:
> >
> >> Hi Jason,
> >>
> >> Thank you for the explanation.
> >>
> >> I have several *low* cardinality fields that contain semi-long values
> and
> >> they are, I think, a perfect candidate for dictionary encoding.
> >>
> >> I assumed that the choose to use dictionary encoding was a bit smarter
> >> than this and would rely on Strings type column where x% repeated values
> >> were a clear signal for it's use.
> >>
> >> If you can outline what  needs to be done and where then I will gladly
> >> take a stab at it :).
> >>
> >> Several questions along those lines:
> >>
> >>    - Does the Parquet library that Drill uses allow for programmatic
> >>    section?
> >>    - What metadata, regarding the column content, is available when the
> >>    choice is made?
> >>    - Where in the Parquet part of Drill is this logic?
> >>    - Is there no ongoing effort in parquet-mr to make the automatic
> >>    handling smarter?
> >>    - Are all Parquet encoding options being used by drill?
> >>    - Like the encoding of longs where delta between semi-subsequent
> >>    numbers is stored. (As I understand it)
> >>
> >> thanks again.
> >>
> >> Regards,
> >>  -Stefan
> >>
> >>
> >>
> >>
> >> On Thu, Feb 4, 2016 at 3:36 PM, Jason Altekruse <
> altekrusejason@gmail.com
> >> > wrote:
> >>
> >>> Hi Stefan,
> >>>
> >>> There is a reason that dictionary is disabled by default. The
> parquet-mr
> >>> library we leverage for writing parquet files currently has the
> behavior
> >>> to
> >>> write nearly all columns as dictionary encoded for all types when
> >>> dictionary encoding is enabled. This includes columns with integers,
> >>> doubles, dates and timestamps.
> >>>
> >>> Do you have some data that you believe is well suited for dictionary
> >>> encoding in the dataset? I think there are good uses for it, such as
> data
> >>> coming from systems that support enumerations, that might be
> represented
> >>> as
> >>> strings when exported from a database for use with Big Data tools like
> >>> Drill. Unfortunately we do not currently provide a mechanism for
> >>> requesting
> >>> dictionary encoding on only some columns, and we don't do anything like
> >>> buffer values to determine if a given column is well-suited for
> >>> dictionary
> >>> encoding before starting to write them.
> >>>
> >>> In many cases it obviously is not a good choice, and so we actually
> take
> >>> a
> >>> performance hit re-materializing the data out of the dictionary upon
> >>> read.
> >>>
> >>> If you would be interested in trying to contribute such an enhancement
> I
> >>> would be willing to help you get started with it.
> >>>
> >>> - Jason
> >>>
> >>> On Wed, Feb 3, 2016 at 5:15 AM, Stefán Baxter <
> stefan@activitystream.com
> >>> >
> >>> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> > I'm converting Avro to parquest and I'm getting this log entry back
> >>> for a
> >>> > timestamp field:
> >>> >
> >>> > Written 1,008,842B for [occurred_at] INT64: 591,435 values,
> 2,169,557B
> >>> raw,
> >>> > 1,008,606B comp, 5 pages, encodings: [BIT_PACKED, PLAIN,
> >>> PLAIN_DICTIONARY,
> >>> > RLE], dic { 123,832 entries, 990,656B raw, 123,832B comp}
> >>> >
> >>> > Can someone please tell me if this is the expected encoding for a
> >>> timestamp
> >>> > field.
> >>> >
> >>> > I'm a bit surprised that it seems to be dictionary based. (Yes, I
> have
> >>> > enabled dictionary encoding for Parquet files).
> >>> >
> >>> > Regards,
> >>> >  -Stefán
> >>> >
> >>>
> >>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message