drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aman Sinha <amansi...@apache.org>
Subject Re: [DISCUSS] Ideas to improve metadata cache read performance
Date Sun, 25 Oct 2015 16:33:34 GMT
Forgot to include the link for Jackson's AfterBurner module:

On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha <amansinha@apache.org> wrote:

> I was going to file an enhancement JIRA but thought I will discuss here
> first:
> The parquet metadata cache file is a JSON file that contains a subset of
> the metadata extracted from the parquet files.  The cache file can get
> really large .. a few GBs for a few hundred thousand files.
> I have filed a separate JIRA: DRILL-3973 for profiling the various aspects
> of planning including metadata operations.  In the meantime, the timestamps
> in the drillbit.log output indicate a large chunk of time spent in creating
> the drill table to begin with, which indicates bottleneck in reading the
> metadata.  (I can provide performance numbers later once we confirm through
> profiling).
> A few thoughts around improvements:
>  - The jackson deserialization of the JSON file is very slow.. can this be
> speeded up ? .. for instance the AfterBurner module of jackson claims to
> improve performance by 30-40% by avoiding the use of reflection.
>  - The cache file read is a single threaded process.  If we were directly
> reading from parquet files, we use a default of 16 threads.  What can be
> done to parallelize the read ?
>  - Any operation that can be done one time during the REFRESH METADATA
> command ?  for instance..examining the min/max values to determine
> single-value for partition column could be eliminated if we do this
> computation during REFRESH METADATA command and store the summary one time.
>  - A pertinent question is: should the cache file be stored in a more
> efficient format such as Parquet instead of JSON ?
> Aman

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message