drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aman Sinha <amansi...@apache.org>
Subject Re: [DISCUSS] Ideas to improve metadata cache read performance
Date Thu, 29 Oct 2015 21:21:43 GMT
I am not so sure...as a user I would want to know through an EXPLAIN plan
exactly what will be executed...including how many files (or partitions)
would be accessed by the query.  This also determines the cardinality of
table needed for planning.   Are we talking about embedding a
mini-execution phase during the planning phase to leverage the benefits of
the execution engine ?

On Thu, Oct 29, 2015 at 1:34 PM, Jacques Nadeau <jacques@dremio.com> wrote:

> Agree with Steven here. Pruning could be added to query profile for post
> execution verification.
> On Oct 29, 2015 11:33 AM, "Steven Phillips" <steven@dremio.com> wrote:
>
> > I agree that this would present a small challenge for testing, but I
> don't
> > think ease of testing should be the primary motivator in designing the
> > software. Once we've decided what we want the software to do, then we can
> > work together to figure out how to test it.
> >
> > On Thu, Oct 29, 2015 at 11:09 AM, rahul challapalli <
> > challapallirahul@gmail.com> wrote:
> >
> > > @steven If we end up pushing the partition pruning to the execution
> > phase,
> > > how would we know that partition pruning even took place. I am thinking
> > > from the standpoint of adding functional tests around partition
> pruning.
> > >
> > > - Rahul
> > >
> > > On Wed, Oct 28, 2015 at 10:53 AM, Parth Chandra <parthc@apache.org>
> > wrote:
> > >
> > > > And ideally, I suppose, the merged schema would correspond to the
> > > > information that we want to keep in a .drill file.
> > > >
> > > >
> > > > On Tue, Oct 27, 2015 at 4:55 PM, Aman Sinha <asinha@maprtech.com>
> > wrote:
> > > >
> > > > > @Steven, w.r.t to your suggestion about doing the metadata
> operation
> > > > during
> > > > > execution phase, see the related discussion in DRILL-3838.
> > > > >
> > > > > A couple of more thoughts:
> > > > >  - Parth and I were discussing keeping track of the merged schema
> as
> > > part
> > > > > of the refresh metadata and storing the merged schema for all files
> > > that
> > > > > have the identical schema (currently this is repeated and is a huge
> > > > > contributor to the size of the file).   To Jacques' point about
> > keeping
> > > > > minimum information needed for planning purposes,  we certainly
> could
> > > do
> > > > a
> > > > > better job in keeping it lean.   The row count of the table could
> be
> > > > > computed at the time of running refresh metadata command.
> Similarly
> > > the
> > > > > analysis of single-value can be done at that time instead of on a
> > > > per-query
> > > > > basis.
> > > > >
> > > > >  - We should revisit DRILL-2517(
> > > > > https://issues.apache.org/jira/browse/DRILL-2517)
> > > > >   Consider the following 2 queries and their total elapsed times
> > > against
> > > > a
> > > > > table with 310000 files:
> > > > >     (A) SELECT  count(*) FROM table WHERE `date` = '2015-07-01';
> > > > >           elapsed time: 980 secs
> > > > >
> > > > >     (B) SELECT count(*) FROM  `table/20150701` ;
> > > > >           elapsed time: 54 secs
> > > > >
> > > > >     From the user perspective, both queries should perform nearly
> the
> > > > same,
> > > > > which was essentially the intent of DRILL-2517.
> > > > >
> > > > >
> > > > > On Tue, Oct 27, 2015 at 12:04 PM, Steven Phillips <
> steven@dremio.com
> > >
> > > > > wrote:
> > > > >
> > > > > > I think we need to come up with a way to push partition pruning
> to
> > > > > > execution time.  The other solutions may relive the problem
in
> some
> > > > > cases,
> > > > > > but won't solve the fundamental problem.
> > > > > >
> > > > > > For example, even if we do figure out how to use multiple threads
> > for
> > > > > > reading the metadata, that may be fine for a couple hundred
> > thousand
> > > > > files,
> > > > > > but what about when we have millions or tens of millions of
> files.
> > It
> > > > > will
> > > > > > still be a huge bottle neck.
> > > > > >
> > > > > > I actually think we should use the Drill execution engine to
> probe
> > > the
> > > > > > metadata and generate the work assignments. We could have an
> > > additional
> > > > > > fragment or fragments of the query that would recursively probe
> the
> > > > > > filesystem, read the metadata, and make assignments, and then
> pipe
> > > the
> > > > > > results into the Scanners, which will create readers on the
fly.
> > This
> > > > way
> > > > > > the query could actually begin doing work before the metadata
has
> > > even
> > > > > been
> > > > > > fully read.
> > > > > >
> > > > > > On Mon, Oct 26, 2015 at 2:42 PM, Jacques Nadeau <
> > jacques@dremio.com>
> > > > > > wrote:
> > > > > >
> > > > > > > My first thought is we've gotten too generous in what we're
> > storing
> > > > in
> > > > > > the
> > > > > > > Parquet metadata file. Early implementations were very
lean and
> > it
> > > > > seems
> > > > > > > far larger today. For example, early implementations didn't
> keep
> > > > > > statistics
> > > > > > > and ignored row groups (files, schema and block locations
> only).
> > If
> > > > we
> > > > > > need
> > > > > > > multiple levels of information, we may want to stagger
(or
> > > normalize)
> > > > > > them
> > > > > > > in the file. Also, we may think about what is the minimum
that
> > must
> > > > be
> > > > > > done
> > > > > > > in planning. We could do the file pruning at execution
time
> > rather
> > > > than
> > > > > > > single-tracking these things (makes stats harder though).
> > > > > > >
> > > > > > > I also think we should be cautious around jumping to a
> conclusion
> > > > until
> > > > > > > DRILL-3973 provides more insight.
> > > > > > >
> > > > > > > In terms of caching, I'd be more inclined to rely on file
> system
> > > > > caching
> > > > > > > and make sure serialization/deserialization is as efficient
as
> > > > possible
> > > > > > as
> > > > > > > opposed to implementing an application-level cache. (We
already
> > > have
> > > > > > enough
> > > > > > > problems managing memory without having to figure out when
we
> > > should
> > > > > > drop a
> > > > > > > metadata cache :D).
> > > > > > >
> > > > > > > Aside, I always liked this post for entertainment and the
> > thoughts
> > > on
> > > > > > > virtual memory:
> > > > https://www.varnish-cache.org/trac/wiki/ArchitectNotes
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jacques Nadeau
> > > > > > > CTO and Co-Founder, Dremio
> > > > > > >
> > > > > > > On Mon, Oct 26, 2015 at 2:25 PM, Hanifi Gunes <
> > hgunes@maprtech.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > One more thing, for workloads running queries over
subsets of
> > > same
> > > > > > > parquet
> > > > > > > > files, we can consider maintaining an in-memory cache
as
> well.
> > > > > Assuming
> > > > > > > > metadata memory footprint per file is low and parquet
files
> are
> > > > > static,
> > > > > > > not
> > > > > > > > needing us to invalidate the cache often.
> > > > > > > >
> > > > > > > > H+
> > > > > > > >
> > > > > > > > On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes <
> > > hgunes@maprtech.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I am not familiar with the contents of metadata
stored but
> if
> > > > > > > > > deserialization workload seems to be fitting
to any of
> > > > > afterburner's
> > > > > > > > > claimed improvement points [1] It could well
be worth
> trying
> > > > given
> > > > > > the
> > > > > > > > > claimed gain on throughput is substantial.
> > > > > > > > >
> > > > > > > > > It could also be a good idea to partition caching
over a
> > number
> > > > of
> > > > > > > files
> > > > > > > > > for better parallelization given number of cache
files
> > > generated
> > > > is
> > > > > > > > > *significantly* less than number of parquet files.
> > Maintaining
> > > > > global
> > > > > > > > > statistics seems an improvement point too.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > -H+
> > > > > > > > >
> > > > > > > > > 1:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized
> > > > > > > > >
> > > > > > > > > On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha <
> > > > amansinha@apache.org>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Forgot to include the link for Jackson's
AfterBurner
> module:
> > > > > > > > >>   https://github.com/FasterXML/jackson-module-afterburner
> > > > > > > > >>
> > > > > > > > >> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha
<
> > > > amansinha@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >> > I was going to file an enhancement JIRA
but thought I
> will
> > > > > discuss
> > > > > > > > here
> > > > > > > > >> > first:
> > > > > > > > >> >
> > > > > > > > >> > The parquet metadata cache file is a
JSON file that
> > > contains a
> > > > > > > subset
> > > > > > > > of
> > > > > > > > >> > the metadata extracted from the parquet
files.  The
> cache
> > > file
> > > > > can
> > > > > > > get
> > > > > > > > >> > really large .. a few GBs for a few
hundred thousand
> > files.
> > > > > > > > >> > I have filed a separate JIRA: DRILL-3973
for profiling
> the
> > > > > various
> > > > > > > > >> aspects
> > > > > > > > >> > of planning including metadata operations.
 In the
> > meantime,
> > > > the
> > > > > > > > >> timestamps
> > > > > > > > >> > in the drillbit.log output indicate
a large chunk of
> time
> > > > spent
> > > > > in
> > > > > > > > >> creating
> > > > > > > > >> > the drill table to begin with, which
indicates
> bottleneck
> > in
> > > > > > reading
> > > > > > > > the
> > > > > > > > >> > metadata.  (I can provide performance
numbers later once
> > we
> > > > > > confirm
> > > > > > > > >> through
> > > > > > > > >> > profiling).
> > > > > > > > >> >
> > > > > > > > >> > A few thoughts around improvements:
> > > > > > > > >> >  - The jackson deserialization of the
JSON file is very
> > > slow..
> > > > > can
> > > > > > > > this
> > > > > > > > >> be
> > > > > > > > >> > speeded up ? .. for instance the AfterBurner
module of
> > > jackson
> > > > > > > claims
> > > > > > > > to
> > > > > > > > >> > improve performance by 30-40% by avoiding
the use of
> > > > reflection.
> > > > > > > > >> >  - The cache file read is a single threaded
process.  If
> > we
> > > > were
> > > > > > > > >> directly
> > > > > > > > >> > reading from parquet files, we use a
default of 16
> > threads.
> > > > > What
> > > > > > > can
> > > > > > > > be
> > > > > > > > >> > done to parallelize the read ?
> > > > > > > > >> >  - Any operation that can be done one
time during the
> > > REFRESH
> > > > > > > METADATA
> > > > > > > > >> > command ?  for instance..examining the
min/max values to
> > > > > determine
> > > > > > > > >> > single-value for partition column could
be eliminated if
> > we
> > > do
> > > > > > this
> > > > > > > > >> > computation during REFRESH METADATA
command and store
> the
> > > > > summary
> > > > > > > one
> > > > > > > > >> time.
> > > > > > > > >> >
> > > > > > > > >> >  - A pertinent question is: should the
cache file be
> > stored
> > > > in a
> > > > > > > more
> > > > > > > > >> > efficient format such as Parquet instead
of JSON ?
> > > > > > > > >> >
> > > > > > > > >> > Aman
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message