Hello,
I have a dataset that I always query on 2 columns that don't have a big
cardinality. So to benefit from pruning, I tried to partition the file
on these keys, but I end up with 50k differents small file (30Mo) and
query on it spend most of the time in the planning phase, to decode the
metadata file, resolve the absolute path…
By looking at the parquet file structure, I saw that there are
statistics at page level and chunk level. So I tried to generated
parquet file where a page is dedicated for one value for the 2 partition
column. By using the statistics, Drill could be able to drop the page/chunk.
But it seems Drill is not making any use of the statistics in the
parquet file because, whatever the query I do, I don't see any change in
the number of page loaded.
Do you confirm my conclusion? What would be the best way to organize the
data so that Drill doesn't read the data that can be pruned easily
Thanks
Damien
|