drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kunal Khatua <kkha...@mapr.com>
Subject Re: Parquet filter pushdown and string fields that use dictionary encoding
Date Thu, 01 Jun 2017 00:47:53 GMT

I might not be completely accurate, but the min-max technique allows you to figure if a String-based
filter potentially exists in a rowgroup (Currently, Drill doesn't check at the page level).
The comparison might be incorrect in cases where the bytes of a text are not interpreted as
unsigned bytes. The Parquet Filter Pushdown filter is applied by Drill during planning time.


However, for dictionary-encoded fields, the Reader/Scanner would need to decode the Dictionary
page to identify whether a filter condition's value is present in the subsequent data pages.
This would (most likely) be done during execution time, and I don't believe Drill does that
as yet.



<http://www.mapr.com/>

________________________________
From: Stefán Baxter <stefan@activitystream.com>
Sent: Wednesday, May 31, 2017 5:08:23 PM
To: user
Subject: Re: Parquet filter pushdown and string fields that use dictionary encoding

Thank you Kunal.

Kan you please explain to me why min/max values would be relevant for
dictionary encoded fields? (I think I may be completely misunderstanding
how they work)

Regards,
 -Stefán

On Wed, May 31, 2017 at 5:55 PM, Kunal Khatua <kkhatua@mapr.com> wrote:

> Even though filter pushdown is supported in Drill, it is limited to
> pushing down of numeric values including dates. We do not support pushdown
> of varchar because of this bug in the parquet library:
>
> https://issues.apache.org/jira/browse/PARQUET-686
>
> <http://www.mapr.com/>
>
> The issue of correctness for comparison is what makes the dependency on
> min-max statistics by the Parquet library be unreliable.
>
>
> ________________________________
> From: Stefán Baxter <stefan@activitystream.com>
> Sent: Monday, May 29, 2017 1:41:30 PM
> To: user
> Subject: Parquet filter pushdown and string fields that use dictionary
> encoding
>
> Hi,
>
> I would like to verify that my understanding of parquet filter pushdown in
> Drill (https://drill.apache.org/docs/parquet-filter-pushdown/) is correct.
>
> Is it correctly understood that Drill does not support predicate push-down
> for string fields when dictionary based string encoding is enabled?  (It
> looks like Presto can do this.)
>
> We save a lot of space using dictionary encoding (not enabled in Drill 1.10
> by default) and if my understanding of how-it-works is correct then the
> segment dictionary could be used to determine if a value is in a segments
> or if it can be pruned/skipped when filtering based on columns that are
> compressed/encoded using a dictionary.
>
> I may be misunderstanding how this works and perhaps the dictionary is
> create for the file as a whole and not individual sections but I know that
> min/max values would not be good to determine the need for a segment scan.
>
> I was hoping we could use partitioning on field(s) with lower cardinality
> to create partitions for typical partition pruning and then sort the
> contents of individual fields by session/customer IDs (which include
> alphanumeric characters here) so that segments would only contain a
> relatively low number of those unique values to facilitate "segment
> pruning" when looking for data belonging to individual sessions/customers.
>
> Best regards,
>  -Stefán Baxter
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message