drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hanifi Gunes <hgu...@maprtech.com>
Subject Re: [DISCUSS] Ideas to improve metadata cache read performance
Date Mon, 26 Oct 2015 21:10:53 GMT
I am not familiar with the contents of metadata stored but if
deserialization workload seems to be fitting to any of afterburner's
claimed improvement points [1] It could well be worth trying given the
claimed gain on throughput is substantial.

It could also be a good idea to partition caching over a number of files
for better parallelization given number of cache files generated is
*significantly* less than number of parquet files. Maintaining global
statistics seems an improvement point too.


-H+

1: https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized

On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha <amansinha@apache.org> wrote:

> Forgot to include the link for Jackson's AfterBurner module:
>   https://github.com/FasterXML/jackson-module-afterburner
>
> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha <amansinha@apache.org> wrote:
>
> > I was going to file an enhancement JIRA but thought I will discuss here
> > first:
> >
> > The parquet metadata cache file is a JSON file that contains a subset of
> > the metadata extracted from the parquet files.  The cache file can get
> > really large .. a few GBs for a few hundred thousand files.
> > I have filed a separate JIRA: DRILL-3973 for profiling the various
> aspects
> > of planning including metadata operations.  In the meantime, the
> timestamps
> > in the drillbit.log output indicate a large chunk of time spent in
> creating
> > the drill table to begin with, which indicates bottleneck in reading the
> > metadata.  (I can provide performance numbers later once we confirm
> through
> > profiling).
> >
> > A few thoughts around improvements:
> >  - The jackson deserialization of the JSON file is very slow.. can this
> be
> > speeded up ? .. for instance the AfterBurner module of jackson claims to
> > improve performance by 30-40% by avoiding the use of reflection.
> >  - The cache file read is a single threaded process.  If we were directly
> > reading from parquet files, we use a default of 16 threads.  What can be
> > done to parallelize the read ?
> >  - Any operation that can be done one time during the REFRESH METADATA
> > command ?  for instance..examining the min/max values to determine
> > single-value for partition column could be eliminated if we do this
> > computation during REFRESH METADATA command and store the summary one
> time.
> >
> >  - A pertinent question is: should the cache file be stored in a more
> > efficient format such as Parquet instead of JSON ?
> >
> > Aman
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message