commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luc Maisonobe <>
Subject Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion about support on variations in, percentile calculation
Date Wed, 18 Jun 2014 14:39:12 GMT
Hi Gilles and Venkat,

Le 18/06/2014 15:40, Gilles a écrit :
> On Wed, 18 Jun 2014 15:02:41 +0200, Luc Maisonobe wrote:
>> Le 18/06/2014 14:32, Gilles a écrit :
>>> Hello Luc.
>>>>> See
>>>>> The problem reported was due to the sorting procedure not behaving
>>>>> correctly in the presence of NaN.
>>>>> This procedure could be replaced by an equivalent method from the JDK:
>>>>>   java.util.Arrays.sort(double[],int,int)
>>>>> Any objection?
>>>> If you imply removing the select, medianOf3 and partition methods,
>>>> yes I
>>>> would object. If you imply replacing only the insertionSort method used
>>>> for small sub-arrays, then I almost agree.
>>> Issue 1129 concerns the latter: See the comment in "insertionSort" in
>>> the
>>> current code.
>>> However, for the former, you should really have a look at MATH-1120
>>> The proposed patch indeed touches those methods.
>> So I think it is worth fixing MATH-1120 first, with the NaNStrategy and
>> go back to MATH-1129 afterwards, to see if it still applies of if
>> MATH-1120 also fixed it.
> As per my last comment on MATH-1120, the "NaNStrategy" part of the patch
> is an extension of the initial feature request.

Sure, but it is a really good addition and seems fair to consider.

> As per the Javadoc, the CM code was originally meant to handle (in some
> way), the presence of NaN values within the data. So I thought that this
> should be fixed before extending the API (which entailed additional
> questions as the proposed patch is fairly "massive").

Yes, the patch is a big one, and it has been works on for a while.
As it has quite stabilized by now, I think it would be the good time
to include it (with some editing I could do) and to start working with
separate patches applied on top of this one.

If we let it out of the code longer, it will become more and more
difficult for the contributor to work on it and to us to include it.

> Especially, the new code is clearly untested for performance, and this
> seems to be an important argument by your previous post.

Yes, it is important as some issues were raised on it (if you remember
well, they were raised by one researcher from another lab working on the
same big project as you, just after the symposium when we met).

> So, it is possible to add "isNaN" conditional, as I did in "insertionSort"
> and fix the current code without additional impact.

I really consider it as an ugly hack rather than attempting to really
fix the mess I did in this class.

> Or, we redesign (as per MATH-1120) which implies forgetting about
> performance for now. The patch removes "insertionSort" altogether!

Well, it is just one line calling
  Arrays.sort(work, begin, end);
I can put the insertion sort back here while editing the patch before
committing it.

> IMHO, it is too many changes at once...

You know the popular saying about « premature optimization ». I think it
would be fair to temporarily forget about performance, fix the issue and
set up a new ground to work on based on MATH-1120 patch, then improve
this by small corrections once the first huge patch has been committed,
and then look back at performances.

> What is the meaning of percentile when NaNs are kept in the data
> (strategy "FIXED")?
> For all the other strategies, the NaNs will not cause a problem
> within the algorithms being discussed here (being replaced by +inf
> or -inf, or removed, or raising an exception).

This is one point we should really discuss, but it can be done later.
I agree with you, FIXED is not appropriate here. We could simply raise
an exception if it is selected as not being meaningful for this algorithm.

>> I will have a look at the proposed patch, including your comments.
> Thanks.

So I did have a look and really think it is worth applying it. I agree
with your comments and will include your change for the NaNs recoding.
This change alone would solve MATH-1129 by the way.

I also am quite puzzled with the various default strategies and would
propose only one. I would even suggest to select immediately R8 instead
of our legacy method, as it would nevertheless be an improvement and
would not break existing code (only provide better results).

As there are several customization area (method, pivoting NaN strategy),
it would also be interesting to use the builder approach we started in
optimization. This would also allow us to pass an argument for the
RANDOM pivoting where the user could select a specific random generator
with specific seed, I really do not like having a random generator
blindly created with the automatic time dependent seed.

Another point I already changed is the medianOf3 method throwing an
exception in addition to being deprecated. It is really not friendly to
users. So I replaced the exception by a call to the appropriate enum
method (but of course let the deprecation in place).

I also changed the pivoting strategy to be an instance field rather than
a class field, just as all the other customization parameters.

best regards,

> Gilles
>>> [...]
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message