drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefán Baxter <ste...@activitystream.com>
Subject Re: Parquet drill date fields
Date Thu, 04 Feb 2016 23:40:12 GMT
OK, the automatic handling and encoding options improve a lot in Parquet
2.0. (Manual override is not an option)

I'm using parquet-mr/parquet-avro to create parquet 2 files
(ParquetProperties.WriterVersion.PARQUET_2_0).

Drill seems to read them just fine but I wonder if there are any gotchas

Regards,
 -Stefán


On Thu, Feb 4, 2016 at 4:51 PM, Stefán Baxter <stefan@activitystream.com>
wrote:

> Hi again,
>
> I did a little test and ~5 million fairly wide records take 791 MB in
> parquet without dictionary encoding and 550MB with dictionary encoding
> enabled (The non-dictionary encoded file is a whooping 45% bigger).
> The plain, non-dictionary-encoding, file returns results for identical
> queries in ~20% less time than the one that uses dictionary encoding.
>
> Regards,
>  -Stefán
>
>
>
> On Thu, Feb 4, 2016 at 3:48 PM, Stefán Baxter <stefan@activitystream.com>
> wrote:
>
>> Hi Jason,
>>
>> Thank you for the explanation.
>>
>> I have several *low* cardinality fields that contain semi-long values and
>> they are, I think, a perfect candidate for dictionary encoding.
>>
>> I assumed that the choose to use dictionary encoding was a bit smarter
>> than this and would rely on Strings type column where x% repeated values
>> were a clear signal for it's use.
>>
>> If you can outline what  needs to be done and where then I will gladly
>> take a stab at it :).
>>
>> Several questions along those lines:
>>
>>    - Does the Parquet library that Drill uses allow for programmatic
>>    section?
>>    - What metadata, regarding the column content, is available when the
>>    choice is made?
>>    - Where in the Parquet part of Drill is this logic?
>>    - Is there no ongoing effort in parquet-mr to make the automatic
>>    handling smarter?
>>    - Are all Parquet encoding options being used by drill?
>>    - Like the encoding of longs where delta between semi-subsequent
>>    numbers is stored. (As I understand it)
>>
>> thanks again.
>>
>> Regards,
>>  -Stefan
>>
>>
>>
>>
>> On Thu, Feb 4, 2016 at 3:36 PM, Jason Altekruse <altekrusejason@gmail.com
>> > wrote:
>>
>>> Hi Stefan,
>>>
>>> There is a reason that dictionary is disabled by default. The parquet-mr
>>> library we leverage for writing parquet files currently has the behavior
>>> to
>>> write nearly all columns as dictionary encoded for all types when
>>> dictionary encoding is enabled. This includes columns with integers,
>>> doubles, dates and timestamps.
>>>
>>> Do you have some data that you believe is well suited for dictionary
>>> encoding in the dataset? I think there are good uses for it, such as data
>>> coming from systems that support enumerations, that might be represented
>>> as
>>> strings when exported from a database for use with Big Data tools like
>>> Drill. Unfortunately we do not currently provide a mechanism for
>>> requesting
>>> dictionary encoding on only some columns, and we don't do anything like
>>> buffer values to determine if a given column is well-suited for
>>> dictionary
>>> encoding before starting to write them.
>>>
>>> In many cases it obviously is not a good choice, and so we actually take
>>> a
>>> performance hit re-materializing the data out of the dictionary upon
>>> read.
>>>
>>> If you would be interested in trying to contribute such an enhancement I
>>> would be willing to help you get started with it.
>>>
>>> - Jason
>>>
>>> On Wed, Feb 3, 2016 at 5:15 AM, Stefán Baxter <stefan@activitystream.com
>>> >
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > I'm converting Avro to parquest and I'm getting this log entry back
>>> for a
>>> > timestamp field:
>>> >
>>> > Written 1,008,842B for [occurred_at] INT64: 591,435 values, 2,169,557B
>>> raw,
>>> > 1,008,606B comp, 5 pages, encodings: [BIT_PACKED, PLAIN,
>>> PLAIN_DICTIONARY,
>>> > RLE], dic { 123,832 entries, 990,656B raw, 123,832B comp}
>>> >
>>> > Can someone please tell me if this is the expected encoding for a
>>> timestamp
>>> > field.
>>> >
>>> > I'm a bit surprised that it seems to be dictionary based. (Yes, I have
>>> > enabled dictionary encoding for Parquet files).
>>> >
>>> > Regards,
>>> >  -Stefán
>>> >
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message