drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Chen <tnac...@gmail.com>
Subject Re: Memory plannig in Drill
Date Thu, 09 Oct 2014 07:37:38 GMT
Hi Parth,

Thanks for providing an update, this is really great to see more
design discussions on the list!

The pipeline chains definitely makes lot of sense, and I still
remember discussions offline around this in the past.

The global memory efficency seems like a scheduling problem, as the
delay of a chain only benefits if there are other chains in-flight,
and be at a chain level or at a query level.

I don't have much to add yet, but love to see how we can start simple on this.

One paper that is relevant that I just started to read is from the
recent OSDI 14, will chime in more once I grasp it.
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/boutin

Tim


On Wed, Oct 8, 2014 at 11:03 PM, Parth Chandra <pchandra@maprtech.com> wrote:
> Hi everyone,
>
> Aman, Jinfeng and I had an initial offline discussion about memory planning
> in Drill. I’m summarizing the discussion below and hoping to initiate a
> discussion around some of these ideas.
>
>
> Order of Execution
> -------------------------
> Assertion: For memory management Order of Execution is a fundamental issue.
>
> One of the problems with memory usage in the execution engine is that all
> operators start up simultaneously and start allocating memory even though
> the downstream operators may not be ready to consume their output.
>
> For example, in the plan below :
>
>       Agg
>        |
>       HJ2
>      /    \
>     HJ1    Scan3
>    /   \
> Scan1   Scan2
>
>
> the scan operators all begin reading data simultaneously. In this case, the
> Hash Joins are blocking operations and the output of Scan3 cannot be
> consumed until HJ1 is ready to emit its results. If, say, Scan2 is on the
> build side of the hash table for HJ1, then HJ1 will not emit any records
> until Scan2 completes its operation. If Scan3 starts immediately, it is
> consuming memory that could be utilized by Scan2. Instead, if we delay the
> start of Scan3 until after HJ1 is ready to emit records, we can utilize
> memory more efficiently.
>
> To address this, we can think of the query plan in terms of pipeline
> chains, where a pipeline chain is a chain of operators terminated by a
> blocking operator.
>
> In the example, there would be three pipeline chains :
> PC1 : Scan1-HJ1-HJ2-Agg
> PC2 : Scan 2
> PC3 : Scan 3
>
> Now, we can see that we can start PC1 and PC2, but PC3 can be delayed until
> PC2 is completed and PC1 has reached HJ2.
>
> One thing we need to consider is that multiple major fragments can be part
> of a single pipeline chain. All these major fragments can begin execution
> if the pipeline chain is ready to begin execution.
>
> We need to think this one through, though. There are probably many details
> to be hashed out, though one thing is certain: the planner has to provide
> some more information to the execution engine in terms of the ordering of
> the pipeline chains. In other words, implementing this needs work on both
> the planner and the execution engine. We also need to work out the details
> of how the idea of a pipeline chain will be reconciled with the idea of
> major/minor fragments which are currently the units of execution.
>
> Fragment memory limit
> —----------------------------
>    We have implemented a simple method to limit the use of memory by a
> single fragment in 0.6 (it is disabled by default). This prevents a single
> fragment from hogging too much memory while other fragments may be starved.
> However the current implementation has some drawbacks:
>   i) The fragment memory allocators have no idea of how much memory is
> really needed by the fragment. The fragment limit is therefore determined
> by dividing the available memory *equally* among the fragments. This is not
> a fair method; a better choice would be to allocate fragment limits based
> on the relative needs of the executing fragments.
>   ii) The idea of limiting memory use by a fragment is a little too narrow,
> since the purpose is to allow many queries, not fragments, to be able to
> run together. The current mechanism favours queries that may have many
> fragments over queries with fewer fragments which may have equivalent
> memory needs.
>
>    To address this, we need to assign memory limits per query instead of
> per fragment. In addition, we have some estimates at the query level for
> the amount of memory that the query may need. We should probably change the
> memory limit implementation to use this estimate and assign limits
> proportionately. In addition, the limit should be calculated per query (and
> assigned to all fragments of the same query). It might be even better if we
> could estimate the memory requirement per fragment and use that as the
> limit.
>
>     Again, some work needs to be done to figure out what data is available
> that the allocators can use and what data can be calculated/estimated at
> the planning stage to allow the allocators to distribute memory fairly.
>
>
>     All critiques and criticisms are welcome, we’re hoping to get a good
> discussion going around this.
>
>
> Parth

Mime
View raw message