drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aman Sinha <amansi...@apache.org>
Subject Re: Performance issue with 2 phase hash-agg design
Date Tue, 20 Jun 2017 22:26:50 GMT
See [1] which talks about this behavior for unique keys and suggests
manually setting the single phase agg.
We would need NDV statistics on the group-by keys to have the optimizer
pick the more efficient scheme.

[1] https://drill.apache.org/docs/guidelines-for-optimizing-aggregation/

On Tue, Jun 20, 2017 at 2:30 PM, Chun Chang <cchang@mapr.com> wrote:

> I also noticed if the keys are mostly unique, the first phase aggregation
> effort is mostly wasted. This can and should be improved.
> One idea is to detect unique keys while processing. When the percentage of
> unique keys exceeds a certain threshold after processing certain percentage
> of data, skip the rest and send directly to downstream second phase
> aggregation.
> ________________________________
> From: rahul challapalli <challapallirahul@gmail.com>
> Sent: Tuesday, June 20, 2017 1:36:31 PM
> To: dev
> Subject: Performance issue with 2 phase hash-agg design
> During the first phase, the hash agg operator is not protected from skew in
> data (Eg : data contains 2 files where the number of records in one file is
> very large compared to the other). Assuming there are only 2 fragments, the
> hash-agg operator in one fragment handles more records and it aggregates
> until the memory available to it gets exhausted, at which point it sends
> the record batches downstream to the hash-partitioner.
> Because the hash-partitioner normalizes the skew in the data, the work is
> evenly divided and the 2 minor fragments running the second phase
> hash-aggregate take similar amount of processing time.
> So what is the problem here? During the first phase one minor fragment
> takes a long time which affects the runtime of the query. Instead, if the
> first phase did not do any aggregation or only used low memory (there by
> limiting the aggregations performed) then the query would have completed
> faster. However the advantage of doing 2-phase aggregation is reduced
> traffic on the network. But if the keys used in group by are mostly unique
> then we loose this advantage as well.
> I was playing with the new spillable hash-agg code and observed that
> increasing memory did not improve the runtime.  This behavior can be
> explained by the above reasoning.
> Aggregating on mostly unique keys may not be a common use case, but any
> thoughts in general about this?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message