drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinfeng Ni <...@apache.org>
Subject Re: Parquet filter pushdown and string fields that use dictionary encoding
Date Thu, 01 Jun 2017 04:46:23 GMT
Kunal is correct that Drill currently supports filter pruning at parquet
row group level, using min/max statistics. Such support is limited to
numeric/timestamp type, due to the potential corrupted varchar min/max
issue as Kunal mentioned.

For now Drill does not support dictionary-based pruning. It would be great
if someone in the community could contribute to make it happen.  That
probably would require lots of work in Parquet reader during execution
time.

On Wed, May 31, 2017 at 5:47 PM, Kunal Khatua <kkhatua@mapr.com> wrote:

>
> I might not be completely accurate, but the min-max technique allows you
> to figure if a String-based filter potentially exists in a rowgroup
> (Currently, Drill doesn't check at the page level). The comparison might be
> incorrect in cases where the bytes of a text are not interpreted as
> unsigned bytes. The Parquet Filter Pushdown filter is applied by Drill
> during planning time.
>
>
> However, for dictionary-encoded fields, the Reader/Scanner would need to
> decode the Dictionary page to identify whether a filter condition's value
> is present in the subsequent data pages. This would (most likely) be done
> during execution time, and I don't believe Drill does that as yet.
>
>
>
> <http://www.mapr.com/>
>
> ________________________________
> From: Stefán Baxter <stefan@activitystream.com>
> Sent: Wednesday, May 31, 2017 5:08:23 PM
> To: user
> Subject: Re: Parquet filter pushdown and string fields that use dictionary
> encoding
>
> Thank you Kunal.
>
> Kan you please explain to me why min/max values would be relevant for
> dictionary encoded fields? (I think I may be completely misunderstanding
> how they work)
>
> Regards,
>  -Stefán
>
> On Wed, May 31, 2017 at 5:55 PM, Kunal Khatua <kkhatua@mapr.com> wrote:
>
> > Even though filter pushdown is supported in Drill, it is limited to
> > pushing down of numeric values including dates. We do not support
> pushdown
> > of varchar because of this bug in the parquet library:
> >
> > https://issues.apache.org/jira/browse/PARQUET-686
> >
> > <http://www.mapr.com/>
> >
> > The issue of correctness for comparison is what makes the dependency on
> > min-max statistics by the Parquet library be unreliable.
> >
> >
> > ________________________________
> > From: Stefán Baxter <stefan@activitystream.com>
> > Sent: Monday, May 29, 2017 1:41:30 PM
> > To: user
> > Subject: Parquet filter pushdown and string fields that use dictionary
> > encoding
> >
> > Hi,
> >
> > I would like to verify that my understanding of parquet filter pushdown
> in
> > Drill (https://drill.apache.org/docs/parquet-filter-pushdown/) is
> correct.
> >
> > Is it correctly understood that Drill does not support predicate
> push-down
> > for string fields when dictionary based string encoding is enabled?  (It
> > looks like Presto can do this.)
> >
> > We save a lot of space using dictionary encoding (not enabled in Drill
> 1.10
> > by default) and if my understanding of how-it-works is correct then the
> > segment dictionary could be used to determine if a value is in a segments
> > or if it can be pruned/skipped when filtering based on columns that are
> > compressed/encoded using a dictionary.
> >
> > I may be misunderstanding how this works and perhaps the dictionary is
> > create for the file as a whole and not individual sections but I know
> that
> > min/max values would not be good to determine the need for a segment
> scan.
> >
> > I was hoping we could use partitioning on field(s) with lower cardinality
> > to create partitions for typical partition pruning and then sort the
> > contents of individual fields by session/customer IDs (which include
> > alphanumeric characters here) so that segments would only contain a
> > relatively low number of those unique values to facilitate "segment
> > pruning" when looking for data belonging to individual
> sessions/customers.
> >
> > Best regards,
> >  -Stefán Baxter
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message