drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aman Sinha <amansi...@apache.org>
Subject [DISCUSS] Ideas to improve metadata cache read performance
Date Sun, 25 Oct 2015 16:28:36 GMT
I was going to file an enhancement JIRA but thought I will discuss here

The parquet metadata cache file is a JSON file that contains a subset of
the metadata extracted from the parquet files.  The cache file can get
really large .. a few GBs for a few hundred thousand files.
I have filed a separate JIRA: DRILL-3973 for profiling the various aspects
of planning including metadata operations.  In the meantime, the timestamps
in the drillbit.log output indicate a large chunk of time spent in creating
the drill table to begin with, which indicates bottleneck in reading the
metadata.  (I can provide performance numbers later once we confirm through

A few thoughts around improvements:
 - The jackson deserialization of the JSON file is very slow.. can this be
speeded up ? .. for instance the AfterBurner module of jackson claims to
improve performance by 30-40% by avoiding the use of reflection.
 - The cache file read is a single threaded process.  If we were directly
reading from parquet files, we use a default of 16 threads.  What can be
done to parallelize the read ?
 - Any operation that can be done one time during the REFRESH METADATA
command ?  for instance..examining the min/max values to determine
single-value for partition column could be eliminated if we do this
computation during REFRESH METADATA command and store the summary one time.

 - A pertinent question is: should the cache file be stored in a more
efficient format such as Parquet instead of JSON ?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message