commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilles <>
Subject Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion about support on variations in, percentile calculation
Date Wed, 18 Jun 2014 15:07:37 GMT
On Wed, 18 Jun 2014 16:39:12 +0200, Luc Maisonobe wrote:
> Hi Gilles and Venkat,
> Le 18/06/2014 15:40, Gilles a écrit :
>> On Wed, 18 Jun 2014 15:02:41 +0200, Luc Maisonobe wrote:
>>> Le 18/06/2014 14:32, Gilles a écrit :
>>>> Hello Luc.
>>>>>> See
>>>>>> The problem reported was due to the sorting procedure not 
>>>>>> behaving
>>>>>> correctly in the presence of NaN.
>>>>>> This procedure could be replaced by an equivalent method from 
>>>>>> the JDK:
>>>>>>   java.util.Arrays.sort(double[],int,int)
>>>>>> Any objection?
>>>>> If you imply removing the select, medianOf3 and partition 
>>>>> methods,
>>>>> yes I
>>>>> would object. If you imply replacing only the insertionSort 
>>>>> method used
>>>>> for small sub-arrays, then I almost agree.
>>>> Issue 1129 concerns the latter: See the comment in "insertionSort" 
>>>> in
>>>> the
>>>> current code.
>>>> However, for the former, you should really have a look at 
>>>> MATH-1120
>>>> The proposed patch indeed touches those methods.
>>> So I think it is worth fixing MATH-1120 first, with the NaNStrategy 
>>> and
>>> go back to MATH-1129 afterwards, to see if it still applies of if
>>> MATH-1120 also fixed it.
>> As per my last comment on MATH-1120, the "NaNStrategy" part of the 
>> patch
>> is an extension of the initial feature request.
> Sure, but it is a really good addition and seems fair to consider.
>> As per the Javadoc, the CM code was originally meant to handle (in 
>> some
>> way), the presence of NaN values within the data. So I thought that 
>> this
>> should be fixed before extending the API (which entailed additional
>> questions as the proposed patch is fairly "massive").
> Yes, the patch is a big one, and it has been works on for a while.
> As it has quite stabilized by now, I think it would be the good time
> to include it (with some editing I could do) and to start working 
> with
> separate patches applied on top of this one.
> If we let it out of the code longer, it will become more and more
> difficult for the contributor to work on it and to us to include it.
>> Especially, the new code is clearly untested for performance, and 
>> this
>> seems to be an important argument by your previous post.
> Yes, it is important as some issues were raised on it (if you 
> remember
> well, they were raised by one researcher from another lab working on 
> the
> same big project as you, just after the symposium when we met).
>> So, it is possible to add "isNaN" conditional, as I did in 
>> "insertionSort"
>> and fix the current code without additional impact.
> I really consider it as an ugly hack rather than attempting to really
> fix the mess I did in this class.
>> Or, we redesign (as per MATH-1120) which implies forgetting about
>> performance for now. The patch removes "insertionSort" altogether!
> Well, it is just one line calling
>   Arrays.sort(work, begin, end);
> I can put the insertion sort back here while editing the patch before
> committing it.
>> IMHO, it is too many changes at once...
> You know the popular saying about « premature optimization ». I think 
> it
> would be fair to temporarily forget about performance, fix the issue 
> and
> set up a new ground to work on based on MATH-1120 patch, then improve
> this by small corrections once the first huge patch has been 
> committed,
> and then look back at performances.

That's fine with me. I was really waiting from someone else to apply
this patch. ;-)


>> What is the meaning of percentile when NaNs are kept in the data
>> (strategy "FIXED")?
>> For all the other strategies, the NaNs will not cause a problem
>> within the algorithms being discussed here (being replaced by +inf
>> or -inf, or removed, or raising an exception).
> This is one point we should really discuss, but it can be done later.
> I agree with you, FIXED is not appropriate here. We could simply 
> raise
> an exception if it is selected as not being meaningful for this 
> algorithm.
>>> I will have a look at the proposed patch, including your comments.
>> Thanks.
> So I did have a look and really think it is worth applying it. I 
> agree
> with your comments and will include your change for the NaNs 
> recoding.
> This change alone would solve MATH-1129 by the way.
> I also am quite puzzled with the various default strategies and would
> propose only one. I would even suggest to select immediately R8 
> instead
> of our legacy method, as it would nevertheless be an improvement and
> would not break existing code (only provide better results).
> As there are several customization area (method, pivoting NaN 
> strategy),
> it would also be interesting to use the builder approach we started 
> in
> optimization. This would also allow us to pass an argument for the
> RANDOM pivoting where the user could select a specific random 
> generator
> with specific seed, I really do not like having a random generator
> blindly created with the automatic time dependent seed.
> Another point I already changed is the medianOf3 method throwing an
> exception in addition to being deprecated. It is really not friendly 
> to
> users. So I replaced the exception by a call to the appropriate enum
> method (but of course let the deprecation in place).
> I also changed the pivoting strategy to be an instance field rather 
> than
> a class field, just as all the other customization parameters.
> best regards,
> Luc
>> Gilles
>>>> [...]

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message