drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Padma Penumarthy <ppenumar...@mapr.com>
Subject Re: Best way to partition the data
Date Fri, 01 Sep 2017 16:02:38 GMT
Have you tried building metadata cache file using "refresh table metadata” command ?
That will help reduce the planning time. Is most of the time spent in planning or execution

Pruning is done at  rowgroup level i.e. at file level (we create one file per rowgroup).
We do not support pruning at page level.
I am thinking if it created 50K files, it means your cardinality is high. You might want to
consider putting some directory hierarchy in place for ex. you can create a directory
for each unique value of column 1 and a file for each unique value of column 2 underneath.
If partition is done correctly, depending upon the filters, we should not read more
rowgroups than what is needed.


On Sep 1, 2017, at 6:54 AM, Damien Profeta <damien.profeta@amadeus.com<mailto:damien.profeta@amadeus.com>>


I have a dataset that I always query on 2 columns that don't have a big cardinality. So to
benefit from pruning, I tried to partition the file on these keys, but I end up with 50k differents
small file (30Mo) and query on it spend most of the time in the planning phase, to decode
the metadata file, resolve the absolute path…

By looking at the parquet file structure, I saw that there are statistics at page level and
chunk level. So I tried to generated parquet file where a page is dedicated for one value
for the 2 partition column. By using the statistics, Drill could be able to drop the page/chunk.
But it seems Drill is not making any use of the statistics in the parquet file because, whatever
the query I do, I don't see any change in the number of page loaded.

Do you confirm my conclusion? What would be the best way to organize the data so that Drill
doesn't read the data that can be pruned easily


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message