drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Profeta <damien.prof...@amadeus.com>
Subject Best way to partition the data
Date Fri, 01 Sep 2017 13:54:11 GMT

I have a dataset that I always query on 2 columns that don't have a big 
cardinality. So to benefit from pruning, I tried to partition the file 
on these keys, but I end up with 50k differents small file (30Mo) and 
query on it spend most of the time in the planning phase, to decode the 
metadata file, resolve the absolute path…

By looking at the parquet file structure, I saw that there are 
statistics at page level and chunk level. So I tried to generated 
parquet file where a page is dedicated for one value for the 2 partition 
column. By using the statistics, Drill could be able to drop the page/chunk.
But it seems Drill is not making any use of the statistics in the 
parquet file because, whatever the query I do, I don't see any change in 
the number of page loaded.

Do you confirm my conclusion? What would be the best way to organize the 
data so that Drill doesn't read the data that can be pruned easily


View raw message