Ted Dunning wrote:
> That is a fine answer for some things, but the parallel cases fail.
>
> My feeling is that there are a few cases where there are nice aggregatable
> summary statistics like moments and there are many cases where this just
> doesn't work well (such as rank statistics).
Yes, this is why not all statistics are "storeless." We have another
"summary" class that maintains its data in storage and supports
"rolling" behavior in DescriptiveStatistics. The discussion here is
focussed on the "storeless" case, which is limited to those stats that
are computable in this way. The cases of interest are stats that can be
computed in one pass through the data but which can't be "aggregated"
post hoc. John's approach provides a simple solution to this problem.
For completeness, we should probably similarly implement aggregation in
the sense defined in MATH224 for DescriptiveStatistics as well.
Phil
> For the latter, case I usually
> make do with a surrogate such as a random subsample or a recency weighted
> random subsample combined with a few aggregatable stats such as total
> samples, max, min, sum and second moment. That gives me most of what I want
> and if the subsample is reasonably large, I can sometimes estimate a few
> parameters such as total uniques. The subsampled data streams can be
> combined trivially so I now have a aggregatable approximation of
> nonaggregatable statistics. For descriptive quantiles this is generally
> just fine.
>
> On Sun, Apr 19, 2009 at 2:44 PM, John Bollinger <thinman42@yahoo.com> wrote:
>
>
>> The key would be to generate the aggregate statistics at the same time as
>> the perpartition ones, instead of aggregating them after the fact.
>>
>
>
>
>
>

To unsubscribe, email: devunsubscribe@commons.apache.org
For additional commands, email: devhelp@commons.apache.org
