drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinfeng Ni <...@apache.org>
Subject Re: Best way to partition the data
Date Fri, 01 Sep 2017 17:28:56 GMT
If you have small cardinality for partitioning column, yet still end up
with 50k different small files, it's possible that you have many parallel
writer minor-fragment (threads).  By default, each writer minor-fragment
will work independently. If you have cardinailty C and N writer minor
fragment, you could end up with up to C*N small files.

There are two possible solutions.

1) You may consider turning the following option to true. This will add
network communication/cpu cost, yet it will reduce the # of files to C.

alter session set `store.partition.hash_distribute` = true;   //default is

2) Reduce the parallel writer minor-fragment by tuning other parameter
before you run CTAS partition statement.

For partition pruning, Drill works on row group level, not at page level.

On Fri, Sep 1, 2017 at 9:02 AM, Padma Penumarthy <ppenumarthy@mapr.com>

> Have you tried building metadata cache file using "refresh table metadata”
> command ?
> That will help reduce the planning time. Is most of the time spent in
> planning or execution ?
> Pruning is done at  rowgroup level i.e. at file level (we create one file
> per rowgroup).
> We do not support pruning at page level.
> I am thinking if it created 50K files, it means your cardinality is high.
> You might want to
> consider putting some directory hierarchy in place for ex. you can create
> a directory
> for each unique value of column 1 and a file for each unique value of
> column 2 underneath.
> If partition is done correctly, depending upon the filters, we should not
> read more
> rowgroups than what is needed.
> Thanks,
> Padma
> On Sep 1, 2017, at 6:54 AM, Damien Profeta <damien.profeta@amadeus.com<
> mailto:damien.profeta@amadeus.com>> wrote:
> Hello,
> I have a dataset that I always query on 2 columns that don't have a big
> cardinality. So to benefit from pruning, I tried to partition the file on
> these keys, but I end up with 50k differents small file (30Mo) and query on
> it spend most of the time in the planning phase, to decode the metadata
> file, resolve the absolute path…
> By looking at the parquet file structure, I saw that there are statistics
> at page level and chunk level. So I tried to generated parquet file where a
> page is dedicated for one value for the 2 partition column. By using the
> statistics, Drill could be able to drop the page/chunk.
> But it seems Drill is not making any use of the statistics in the parquet
> file because, whatever the query I do, I don't see any change in the number
> of page loaded.
> Do you confirm my conclusion? What would be the best way to organize the
> data so that Drill doesn't read the data that can be pruned easily
> Thanks
> Damien

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message