drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boaz Ben-Zvi <bben-...@mapr.com>
Subject Re: Drill: Memory Spilling for the Hash Aggregate Operator
Date Tue, 17 Jan 2017 02:55:07 GMT
The design in the document has an undocumented assumption — allow the memory available for
an operator to fluctuate during the operator’s lifetime.
For example, the Hash Agregate operator (HAG) may hit the memory limit at 500MB, than later
hit it again at 600MB, etc, and when reading a spilled partition back from disk the limit
may be 400MB, etc.
(A more gentle scheme may allow memory to increase, never go down; e.g., other operators may
"give back” some of their allocations during execution; 
e.g. HHJ after finishing the build phase — can yield the extra memory to another operator).

This is a more sophisticated memory management design than what we have now — a simple fixed

A second point is Hash Aggregation — which is a little different from HHJ. For example,
spilling a partition X (in multiple spill iterations) ends up with 500MB on disk, but only
(a fixed) 400MB memory is available.
Does this mean that partition X would not fit into the memory ?  No, as X may contain many
duplicate groups (groups were recreated after every spill), so as we read and aggregate X,
its size may “shrink” enough for the 400MB.

Indeed the design was not trying to avoid “re-writing to disk”; however the assumption
was that this would be a rare situation, and may only apply to a part of the data.
If the initial number of partitions chosen N is large enough, then spilling would happen only
once per (some of the) data.

For HAG, the partitions should all be of similar sizes, so if N was chosen too small, they
all would need a “secondary spilling” which would include re-writing all the data (that’s
the analogous to “phase 3” ….)
HHJ is a little different (due to possible key duplicate rows) — some partitions may be
bigger than others. So maybe only few HHJ partitions would be written twice.

The downside of the “sub-partition” model described in Goetz’ paper is that indeed a
second write of all the data is saved, but all the data (mixed together) needs to be read
(up to) n times (plus some CPU overhead to filter out the needed parts).
Read may be costly, due to seeks (e.g., when a partition was spilled in many iterations, and
ended scattered across the disk.) 
For the (future) HHJ design — we could handle a “too big” partition using a “hash
loop”; that is, read only a part of the inner partition, then scan the whole outer matching
partition, then read the next inner part, and scan the whole outer partition again.
This is indeed costly, but not much different from the “sub-partitioning” scheme — we
just read part of the inner based on size (instead of several parts, like 0.0, 0.1, and 0.2),
and then the whole outer without filtering - the hash table probing would do that.

     Thanks for the suggestions and the link; I’ll go over Goetz’ paper again and look
for more ideas.

          — Boaz

> On Jan 16, 2017, at 4:09 PM, Julian Hyde <jhyde@apache.org> wrote:
> Does the data need to be written into a disk-friendly format when a partition is selected
to be written to disk? If you are careful in your choice of format then it doesn’t need
to be re-written. And in fact you can start with the assumption that everything is going to
> One of the most beautiful innovations of the HHJ paper [1] is the histogram-driven partitioning.
Basically, during phase 1 you apply the phase 2 hash function to assign rows to “sub-partitions”.
Partition 0 would contain sub-partitions 0.0, 0.1, … 0.n; partition 1 would contain sub-partitions
1.1, …, 1.n. The rows are all mixed together in each partition, but you know how many rows
(and bytes) are in each sub-partition. If partition 0 (or any partition) ends up larger than
memory then you are going to need a phase 3. But you can enter phase 2 armed with some very
useful knowledge. You know the sizes of the sub-partitions and you can choose a hash function
in phase 2 such that many of the partitions end up *just* smaller than memory.
> The big problem with external sort and hash algorithms is the huge performance hit when
you require an extra phase. If you need 2 phases, HHJ can convert that to 1.5 phases (by pulling
smaller partitions back into memory) and by optimizing the assignment of rows to partitions
it can turn a 3.1 phase query into a 2.9 phase query - a big win.
> Julian
> [1] https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf <https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf>
>> On Jan 14, 2017, at 7:34 AM, Boaz Ben-Zvi <bben-zvi@mapr.com> wrote:
>> Sorry for no attachment (Apache mail rules) -- Here is a link to the document:
>> DrillSpillmemoryforHashAggregation.pdf - https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing
>> [https://lh3.googleusercontent.com/U9FNbWEBljT-HDRBE1-vhMnE4Ug5YFgutztvbys2UnTiVp-FQX6mzQ=w1200-h630-p]<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
>> DrillSpillmemoryforHashAggregation.pdf<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
>> drive.google.com
>>   -- Boaz
>> ________________________________
>> From: Julian Hyde <jhyde@apache.org>
>> Sent: Friday, January 13, 2017 11:00 PM
>> To: dev@drill.apache.org
>> Subject: Re: Drill: Memory Spilling for the Hash Aggregate Operator
>> The attachment didn't come through. I'm hoping that you settled on a "hybrid" hash
algorithm that can write to disk, or write to memory, and the cost of discovering that is
wrong is not too great. With Goetz Graefe's hybrid hash join (which can be easily adapted
to hybrid hash aggregate) if the input ALMOST fits in memory you could process most of it
in memory, then revisit the stuff you spilled to disk.
>>> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <bben-zvi@mapr.com> wrote:
>>> Hi Drill developers,
>>>    Attached is a document describing the design for memory spilling implementation
for the Hash Aggregate operator.
>>>    Please send me any comments or questions,
>>>       -- Boaz

View raw message