On 3 January 2014 22:58, Rainer Jung <rainer.jung@kippdata.de> wrote:
> On 03.01.2014 13:57, bugzilla@apache.org wrote:
>> https://issues.apache.org/bugzilla/show_bug.cgi?id=55932
>>
>>  Comment #6 from Sebb <sebb@apache.org> 
>> I have been having a look at the implementation.
>>
>> I don't really see that it needs Commons Math; we aleady have StatCalculator
>> which handles percentiles and more.
>>
>> Likewise, does it really need Commons Pool?
>> It seems wrong to have to have 2 separate pools of SocketOutputStream
>> instances.
>> How many of these would there be?
>>
>> Also, DescriptiveStatistics is not threadsafe (nor is StatCalculator).
>>
>> If we do implement something like this, I think the data processing needs
>> either to be carefully synchronised, or the raw data should be sent to a
>> separate singleton background thread.
>
> FWIW: I always get a bit nervous when percentiles are calculated.
> Percentiles are expensive to calculate if one needs exact results with
> given percentage numbers (50%, 99%, 99.9% etc.). In that case one needs
> to keep all values as an ordered list to calculate the percentiles. For
> a long running test that would be expensive in terms of memory but also
> in terms of CPU (sorting). There's no way of exactly merging percentiles
> from interim statistical data.
>
> Sometimes approximations are enough. By approximation I don't mean
> estimated data, but percentages which are not exactly the ones you are
> keen for. E.g. you would get a 48% value instead of a 50% value, or a
> 99.02% value instead of a 99% value.
>
> Suppose you would know (configure) that only very few samples will take
> longer than 1000ms, then one could create fixed bins for e.g. 10ms,
> 15ms, 20ms, 25ms, 30ms, 40ms, 50ms, 75ms, 100ms, 150ms, 200ms, 250ms,
> 300ms, 400ms, 500ms, 750ms and 1000ms. Now whenever a sample finishes
> you count the sample in the bin it belongs to and do not save the data
> (of course you can still log it). At any time you can now look at the
> not need to keep all sample values around and sort them, but one does
> also not get equidistant percentiles (10%, 11%, 12%, ...).
StatCalculator already takes a similar approach, counting values
rather than storing them.
We already use it for the GUI listeners.
There are other approaches; Commons Math DescriptiveStatistics uses an
array of doubles (with a sliding window).
And there is also the following:
http://searchlucene.com/jd/mahout/math/org/apache/mahout/math/stats/OnlineSummarizer.html
However, AFAICT it does not support arbitrary percentiles, only quartiles.
