drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aman Sinha <amansi...@gmail.com>
Subject Re: Is there any plan for drill to support Parquet Format version 2.5 added column indexes?
Date Wed, 12 Dec 2018 19:21:39 GMT
This seems quite interesting.  Drill does row group pruning, but doing the
page level pruning based on indexes would be big win.
Also, as you may know, Drill recently added a feature to leverage secondary
indexes in NoSQL databases [1].  However, we have to see whether
that capability applies to the Parquet index since the Parquet index is
local to each file.

Please create a JIRA and add your input into it.  Thanks.

[1] https://issues.apache.org/jira/browse/DRILL-6381

On Wed, Dec 12, 2018 at 10:30 AM Lou kevin <lou.kevinx@gmail.com> wrote:

> Hi, I am a drill user and use parquet as the store format.
> I have known some new feature has been added to the latest Parquet Format.
> The new Parquet feature of column indexes seams very attractive and is
> there any plan to be supported in drill?
> thanks very much!
> the feature detail´╝Ü
> https://github.com/apache/parquet-format/blob/master/CHANGES.md#version-250
> See https://issues.apache.org/jira/browse/PARQUET-1201
> And the goals: make both range scans and point lookups I/O efficient by
> allowing direct access to pages based on their min and max values. In
> particular:
> 1.A single-row lookup in a rowgroup based on the sort column of that
> rowgroup will only read one data page per retrieved column. Range scans on
> the sort column will only need to read the exact data pages that contain
> relevant data.
> 2.Make other selective scans I/O efficient: if we have a very selective
> predicate on a non-sorting column, for the other retrieved columns we
> should only need to access data pages that contain matching rows.
> 3.No additional decoding effort for scans without selective predicates,
> e.g., full-row group scans. If a reader determines that it does not need to
> read the index data, it does not incur any overhead.
> 4.Index pages for sorted columns use minimal storage by storing only the
> boundary elements between pages.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message