commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From venkatesha m <>
Subject Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion about support on variations in, percentile calculation
Date Wed, 18 Jun 2014 18:28:01 GMT
Hi Luc, Gilles

First of, Iam immensly thankful to all your comments on this patch. Next, i am attaching my
new patch with today's date(18-jun). However please advise if i need to remove the old patch
file if it confuses.

Please find my response below. The new patch has the suggested changes in the switch case
for nan handling; But; However i have my view points on the different default nan strategies
associated to types. Please permit me to explain (sorry for long summary)

i would like to leave the Default implementation of Percentile as-is (Meaning in my MATH-1120
patch it is mapped to Type.CM)since otherwise we will break user old expectation even for
non nan and non inf entries as well. The existing Percentile implementation also did a copy
of the array slice before doing kth slection(the old evaluate method can be refered) and even
i am doing the same it slightly in different circumstance in the switch case with nan filtering.
The existing tests could fail if we change the default types to say R_8 (please refer to
code as well to see the finer variations in expected values that is being looked at for different
types. I have used R execution as the basis for setting those values).

Secondly, header comment states somewhere to an effect that NaNs would be (left as-is
and) handled by java's default sort behavior and no removal being done. So for me to map this
behaviour to new implementation; it was NaNStrategy.FIXED that came close and didnt require
any of the existing test cases for the existing Percentile behavior to change. What i am re-iterating
here is the existing behavior tests have completely passed with new Type.CM and FIXED. (And
now i have added several more tests including different types as well).

While all the R_x (where x :[1-9]) types as run and verified by R tool; seemed to clearly
convey the NaNs needed to be removed and hence you see that i have used different strategy
I agree while multiple defaults are not wise to have ; however; if we are forced to have Apache
CM as supported type (which is not one of R_x types) and we have the need to support multiple
variants (R1- R9) ; then it is inevitable to have type sepcific NaNStrategy as per the need.
I also feel ; NaN handling should be allowed for overriding atleast in a controlled manner
as different use cases may exist for needing this variation in nan handling. Therefore IMO
while we could avoid the public access to change these defaults; it is relevant to support
these variations of nan handling on a per type and allow atleast sub classes to override if
a rare need arises.
While the very name NaNStrategy reminds me of different ways to look at that; i feel we will
be much restricted if we just said that we stick to one way of NaNHandling for all types.
Please let me know your thoughts.

Regarding the PivotingStrategy; At first, i wanted to convey here that to have all the partitioning,
pivoting and selection in separated classes/enums than inside main class. I have made it as
static due to the fact that; it is more of a non-functional requirement and felt that it need
not be set for every instance (more of a global setting that doesnt vary across types).
Please correct me here and let know if it still needs to be per instance.
I also made it package accessable/settable solely because medianOf3 method had been package
level for the sole intent of possible overriding of the same within that package. Meaning;
if some one really needed to tinker around pivoting they need a way to do it which i have
provided it using a strategy.

In the current patch that i am going to attach as new dated patch (since you have already
started looking at the old one ; which i would leave it as is). I find many utility type methods;
replaceNaN, removeNaN( Predicated Lists ) and copyOf(values, begin.length) and as well as
KthSelector with PivotingStrategy etc all of which can perhaps make its way to MathArrays
and MathUtils. Please let me know.If so i will once again re-factor these changes and submit
the patch.

Thanks for reading this through and for your time in reviewing . Please let me know your opinion
on all of these.


On Wed, 18/6/14, Gilles <> wrote:

 Subject: Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion about support on variations
in, percentile calculation
 Date: Wednesday, 18 June, 2014, 8:37 PM
 On Wed, 18 Jun 2014 16:39:12 +0200,
 Luc Maisonobe wrote:
 > Hi Gilles and Venkat,
 > Le 18/06/2014 15:40, Gilles a écrit :
 >> On Wed, 18 Jun 2014 15:02:41 +0200, Luc Maisonobe
 >>> Le 18/06/2014 14:32, Gilles a écrit :
 >>>> Hello Luc.
 >>>>>> See
 >>>>>> The problem reported was due to the
 sorting procedure not behaving
 >>>>>> correctly in the presence of NaN.
 >>>>>> This procedure could be replaced by
 an equivalent method from the JDK:
 >>>>>>   java.util.Arrays.sort(double[],int,int)
 >>>>>> Any objection?
 >>>>> If you imply removing the select,
 medianOf3 and partition methods,
 >>>>> yes I
 >>>>> would object. If you imply replacing
 only the insertionSort method used
 >>>>> for small sub-arrays, then I almost
 >>>> Issue 1129 concerns the latter: See the
 comment in "insertionSort" in
 >>>> the
 >>>> current code.
 >>>> However, for the former, you should really
 have a look at MATH-1120
 >>>> The proposed patch indeed touches those
 >>> So I think it is worth fixing MATH-1120 first,
 with the NaNStrategy and
 >>> go back to MATH-1129 afterwards, to see if it
 still applies of if
 >>> MATH-1120 also fixed it.
 >> As per my last comment on MATH-1120, the
 "NaNStrategy" part of the patch
 >> is an extension of the initial feature request.
 > Sure, but it is a really good addition and seems fair
 to consider.
 >> As per the Javadoc, the CM code was originally
 meant to handle (in some
 >> way), the presence of NaN values within the data.
 So I thought that this
 >> should be fixed before extending the API (which
 entailed additional
 >> questions as the proposed patch is fairly
 > Yes, the patch is a big one, and it has been works on
 for a while.
 > As it has quite stabilized by now, I think it would be
 the good time
 > to include it (with some editing I could do) and to
 start working with
 > separate patches applied on top of this one.
 > If we let it out of the code longer, it will become
 more and more
 > difficult for the contributor to work on it and to us
 to include it.
 >> Especially, the new code is clearly untested for
 performance, and this
 >> seems to be an important argument by your previous
 > Yes, it is important as some issues were raised on it
 (if you remember
 > well, they were raised by one researcher from another
 lab working on the
 > same big project as you, just after the symposium when
 we met).
 >> So, it is possible to add "isNaN" conditional, as I
 did in "insertionSort"
 >> and fix the current code without additional
 > I really consider it as an ugly hack rather than
 attempting to really
 > fix the mess I did in this class.
 >> Or, we redesign (as per MATH-1120) which implies
 forgetting about
 >> performance for now. The patch removes
 "insertionSort" altogether!
 > Well, it is just one line calling
 >   Arrays.sort(work, begin, end);
 > I can put the insertion sort back here while editing
 the patch before
 > committing it.
 >> IMHO, it is too many changes at once...
 > You know the popular saying about « premature
 optimization ». I think it
 > would be fair to temporarily forget about performance,
 fix the issue and
 > set up a new ground to work on based on MATH-1120
 patch, then improve
 > this by small corrections once the first huge patch has
 been committed,
 > and then look back at performances.
 That's fine with me. I was really waiting from someone else
 to apply
 this patch. ;-)
 >> What is the meaning of percentile when NaNs are
 kept in the data
 >> (strategy "FIXED")?
 >> For all the other strategies, the NaNs will not
 cause a problem
 >> within the algorithms being discussed here (being
 replaced by +inf
 >> or -inf, or removed, or raising an exception).
 > This is one point we should really discuss, but it can
 be done later.
 > I agree with you, FIXED is not appropriate here. We
 could simply raise
 > an exception if it is selected as not being meaningful
 for this algorithm.
 >>> I will have a look at the proposed patch,
 including your comments.
 >> Thanks.
 > So I did have a look and really think it is worth
 applying it. I agree
 > with your comments and will include your change for the
 NaNs recoding.
 > This change alone would solve MATH-1129 by the way.
 > I also am quite puzzled with the various default
 strategies and would
 > propose only one. I would even suggest to select
 immediately R8 instead
 > of our legacy method, as it would nevertheless be an
 improvement and
 > would not break existing code (only provide better
 > As there are several customization area (method,
 pivoting NaN strategy),
 > it would also be interesting to use the builder
 approach we started in
 > optimization. This would also allow us to pass an
 argument for the
 > RANDOM pivoting where the user could select a specific
 random generator
 > with specific seed, I really do not like having a
 random generator
 > blindly created with the automatic time dependent
 > Another point I already changed is the medianOf3 method
 throwing an
 > exception in addition to being deprecated. It is really
 not friendly to
 > users. So I replaced the exception by a call to the
 appropriate enum
 > method (but of course let the deprecation in place).
 > I also changed the pivoting strategy to be an instance
 field rather than
 > a class field, just as all the other customization
 > best regards,
 > Luc
 >> Gilles
 >>>> [...]
 To unsubscribe, e-mail:
 For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message