Hi James, The behavior you see can mostly be explained by noting the way the two aggregates work. The streaming agg is a sequential operator: it works with sorted data, starts one aggregate, gathers all data, then resets for the next. The hash agg is a parallel aggregate: it runs all aggregates in parallel, it will start all aggregates at the same time, add data to each of them depending on the hash key as it arrives, and complete all aggregates at the same time at the end. There is no reset needed in a parallel agg. The real question is whether the parallel (hash) agg correctly calls the add method multiple times and the the output once for each of the parallel aggregates. You are seeing the key trade-off between the two implementations: the sequential (streaming) agg is very memory frugal, but requires a sort to organize data. The parallel (hash) agg requires no sort, at the cost of more memory to hold all active groups in memory. Classic DB stuff. Thanks, - Paul On Sun, Jul 26, 2020 at 7:56 AM James Turton wrote: > Hi all > > I'm writing an aggregate UDF with help from the notes here > > https://github.com/paul-rogers/drill/wiki/Aggregate-UDFs > > . I'm printing a line to stderr from each of the UDF methods so I can > keep an eye on the call sequence. When my UDF is invoked by a > StreamingAgg operator the lifecycle of method calls - setup(), reset(), > add(), output() - is as described in the wiki. When my UDF is invoked > by a HashAgg operator things change dramatically. The setup() method is > called some hundreds of times and reset() is never called even though I > have three groups in the query's "group by"! Anyone know what could be > happening here? > > Thanks > James > > -- > PGP public key >