drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Parth Chandra <pchan...@maprtech.com>
Subject Memory plannig in Drill
Date Thu, 09 Oct 2014 06:03:43 GMT
Hi everyone,

Aman, Jinfeng and I had an initial offline discussion about memory planning
in Drill. I’m summarizing the discussion below and hoping to initiate a
discussion around some of these ideas.


Order of Execution
-------------------------
Assertion: For memory management Order of Execution is a fundamental issue.

One of the problems with memory usage in the execution engine is that all
operators start up simultaneously and start allocating memory even though
the downstream operators may not be ready to consume their output.

For example, in the plan below :

      Agg
       |
      HJ2
     /    \
    HJ1    Scan3
   /   \
Scan1   Scan2


the scan operators all begin reading data simultaneously. In this case, the
Hash Joins are blocking operations and the output of Scan3 cannot be
consumed until HJ1 is ready to emit its results. If, say, Scan2 is on the
build side of the hash table for HJ1, then HJ1 will not emit any records
until Scan2 completes its operation. If Scan3 starts immediately, it is
consuming memory that could be utilized by Scan2. Instead, if we delay the
start of Scan3 until after HJ1 is ready to emit records, we can utilize
memory more efficiently.

To address this, we can think of the query plan in terms of pipeline
chains, where a pipeline chain is a chain of operators terminated by a
blocking operator.

In the example, there would be three pipeline chains :
PC1 : Scan1-HJ1-HJ2-Agg
PC2 : Scan 2
PC3 : Scan 3

Now, we can see that we can start PC1 and PC2, but PC3 can be delayed until
PC2 is completed and PC1 has reached HJ2.

One thing we need to consider is that multiple major fragments can be part
of a single pipeline chain. All these major fragments can begin execution
if the pipeline chain is ready to begin execution.

We need to think this one through, though. There are probably many details
to be hashed out, though one thing is certain: the planner has to provide
some more information to the execution engine in terms of the ordering of
the pipeline chains. In other words, implementing this needs work on both
the planner and the execution engine. We also need to work out the details
of how the idea of a pipeline chain will be reconciled with the idea of
major/minor fragments which are currently the units of execution.

Fragment memory limit
—----------------------------
   We have implemented a simple method to limit the use of memory by a
single fragment in 0.6 (it is disabled by default). This prevents a single
fragment from hogging too much memory while other fragments may be starved.
However the current implementation has some drawbacks:
  i) The fragment memory allocators have no idea of how much memory is
really needed by the fragment. The fragment limit is therefore determined
by dividing the available memory *equally* among the fragments. This is not
a fair method; a better choice would be to allocate fragment limits based
on the relative needs of the executing fragments.
  ii) The idea of limiting memory use by a fragment is a little too narrow,
since the purpose is to allow many queries, not fragments, to be able to
run together. The current mechanism favours queries that may have many
fragments over queries with fewer fragments which may have equivalent
memory needs.

   To address this, we need to assign memory limits per query instead of
per fragment. In addition, we have some estimates at the query level for
the amount of memory that the query may need. We should probably change the
memory limit implementation to use this estimate and assign limits
proportionately. In addition, the limit should be calculated per query (and
assigned to all fragments of the same query). It might be even better if we
could estimate the memory requirement per fragment and use that as the
limit.

    Again, some work needs to be done to figure out what data is available
that the allocators can use and what data can be calculated/estimated at
the planning stage to allow the allocators to distribute memory fairly.


    All critiques and criticisms are welcome, we’re hoping to get a good
discussion going around this.


Parth

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message