drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefán Baxter <ste...@activitystream.com>
Subject Re: Parquet filter pushdown and string fields that use dictionary encoding
Date Thu, 01 Jun 2017 00:08:23 GMT
Thank you Kunal.

Kan you please explain to me why min/max values would be relevant for
dictionary encoded fields? (I think I may be completely misunderstanding
how they work)

Regards,
 -Stefán

On Wed, May 31, 2017 at 5:55 PM, Kunal Khatua <kkhatua@mapr.com> wrote:

> Even though filter pushdown is supported in Drill, it is limited to
> pushing down of numeric values including dates. We do not support pushdown
> of varchar because of this bug in the parquet library:
>
> https://issues.apache.org/jira/browse/PARQUET-686
>
> <http://www.mapr.com/>
>
> The issue of correctness for comparison is what makes the dependency on
> min-max statistics by the Parquet library be unreliable.
>
>
> ________________________________
> From: Stefán Baxter <stefan@activitystream.com>
> Sent: Monday, May 29, 2017 1:41:30 PM
> To: user
> Subject: Parquet filter pushdown and string fields that use dictionary
> encoding
>
> Hi,
>
> I would like to verify that my understanding of parquet filter pushdown in
> Drill (https://drill.apache.org/docs/parquet-filter-pushdown/) is correct.
>
> Is it correctly understood that Drill does not support predicate push-down
> for string fields when dictionary based string encoding is enabled?  (It
> looks like Presto can do this.)
>
> We save a lot of space using dictionary encoding (not enabled in Drill 1.10
> by default) and if my understanding of how-it-works is correct then the
> segment dictionary could be used to determine if a value is in a segments
> or if it can be pruned/skipped when filtering based on columns that are
> compressed/encoded using a dictionary.
>
> I may be misunderstanding how this works and perhaps the dictionary is
> create for the file as a whole and not individual sections but I know that
> min/max values would not be good to determine the need for a segment scan.
>
> I was hoping we could use partitioning on field(s) with lower cardinality
> to create partitions for typical partition pruning and then sort the
> contents of individual fields by session/customer IDs (which include
> alphanumeric characters here) so that segments would only contain a
> relatively low number of those unique values to facilitate "segment
> pruning" when looking for data belonging to individual sessions/customers.
>
> Best regards,
>  -Stefán Baxter
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message