drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <par0...@gmail.com>
Subject Re: Aggregate UDF and HashAgg
Date Mon, 27 Jul 2020 17:36:37 GMT
Hi James,

The behavior you see can mostly be explained by noting the way the two
aggregates work. The streaming agg is a sequential operator: it works with
sorted data, starts one aggregate, gathers all data, then resets for the
next. The hash agg is a parallel aggregate: it runs all aggregates in
parallel, it will start all aggregates at the same time, add data to each
of them depending on the hash key as it arrives, and complete all
aggregates at the same time at the end. There is no reset needed in a
parallel agg.

The real question is whether the parallel (hash) agg correctly calls the
add method multiple times and the the output once for each of the parallel
aggregates.

You are seeing the key trade-off between the two implementations: the
sequential (streaming) agg is very memory frugal, but requires a sort to
organize data. The parallel (hash) agg requires no sort, at the cost of
more memory to hold all active groups in memory. Classic DB stuff.

Thanks,

- Paul


On Sun, Jul 26, 2020 at 7:56 AM James Turton <james@somecomputer.xyz> wrote:

> Hi all
>
> I'm writing an aggregate UDF with help from the notes here
>
> https://github.com/paul-rogers/drill/wiki/Aggregate-UDFs
>
> .  I'm printing a line to stderr from each of the UDF methods so I can
> keep an eye on the call sequence.  When my UDF is invoked by a
> StreamingAgg operator the lifecycle of method calls - setup(), reset(),
> add(), output() - is as described in the wiki.  When my UDF is invoked
> by a HashAgg operator things change dramatically.  The setup() method is
> called some hundreds of times and reset() is never called even though I
> have three groups in the query's "group by"!  Anyone know what could be
> happening here?
>
> Thanks
> James
>
> --
> PGP public key <http://somecomputer.xyz/james.asc>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message