commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From venkatesha murthy <venkateshamurth...@gmail.com>
Subject Re: [math] use case for NaNStrategy.FIXED?
Date Sun, 29 Jun 2014 18:01:03 GMT
On Sun, Jun 29, 2014 at 10:55 PM, Phil Steitz <phil.steitz@gmail.com> wrote:

> On 6/29/14, 9:48 AM, venkatesha murthy wrote:
> > On Sun, Jun 29, 2014 at 1:25 AM, Phil Steitz <phil.steitz@gmail.com>
> wrote:
> >
> >> On 6/25/14, 12:12 AM, Luc Maisonobe wrote:
> >>> Le 25/06/2014 06:05, venkatesha murthy a écrit :
> >>>> On Wed, Jun 25, 2014 at 12:21 AM, Luc Maisonobe <luc@spaceroots.org>
> >> wrote:
> >>>>> Hi Venkat,
> >>>>>
> >>>>> Le 23/06/2014 21:08, venkatesha murthy a écrit :
> >>>>>> On Tue, Jun 24, 2014 at 12:08 AM, Luc Maisonobe <luc@spaceroots.org
> >
> >>>>> wrote:
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> While looking further in Percentile class for MATH-1120,
I have
> found
> >>>>>>> another problem in the current implementation. NaNStrategy.FIXED
> >> should
> >>>>>>> leave the NaNs in place, but at the end of the KthSelector.select
> >>>>>>> method, a call to Arrays.sort moves the NaNs to the end
of the
> small
> >>>>>>> sub-array. What is really implemented here is therefore
closer to
> >>>>>>> NaNStrategy.MAXIMAL than NaNStrategy.FIXED. This always
occur in
> the
> >>>>>>> test cases because they use very short arrays, and we directly
> >> switch to
> >>>>>>> this part of the select method.
> >>>>>> Are NaNs considered higher than +Inf ?
> >>>>>> If MAXIMAL represent replacing for +inf ; you need something
to
> >>>>>> indicate beyond this for NaN.
> >>>>> Well, we can just keep the NaN themselves and handled them
> >>>>> appropriately, hoping not to trash performances too much.
> >>>>>
> >>>> Agreed.
> >>>>
> >>>>>> What is the test input you see an issue and what is the actual
error
> >>>>>> you are seeing. Please share the test case.
> >>>>> Just look at PercentileTest.testReplaceNanInRange(). The first check
> in
> >>>>> the test corresponds to a Percentile configuration at 50% percentile,
> >>>>> and NaNStrategy.FIXED. The array has an odd number of entries, so
the
> >>>>> 50% percentile is exactly one of the entries: the one at index 5
in
> the
> >>>>> final array.
> >>>>>
> >>>>> The initial ordering is { 0, 1, 2, 3, 4, NaN, NaN, 5, 7, NaN, 8
}. So
> >>>>> for the NaNStrategy.FIXED setting, it should not be modified at
all
> in
> >>>>> the selection algorithm and the result for 50% should be the first
> NaN
> >>>>> of the array, at index 5. In fact, due to the Arrays.sort, we *do*
> >>>>> reorder the array into  { 0, 1, 2, 3, 4, 5, 7, 8, NaN, NaN, NaN
}, so
> >>>>> the result is 5.
> >>>>>
> >>>>> Agreed. just verified by putting back the earlier insertionSort
> >> function.
> >>>>> If we use NaNStrategy.MAXIMAL and any quantile above 67%, we get
as a
> >>>>> result Double.POSITIVE_INFINITY instead of Double.NaN.
> >>>>>
> >>>>> If we agree to leave FIXED as unchanged behaviour with your
> >> insertionSort
> >>>> code; then atleast MAXIMAL/MINIMAL should be allowed for
> transformation
> >> of
> >>>> NaN to +/-Inf
> >>> I'm fine with it, as long as we clearly state it in the NaNStrategy
> >>> documentation, saying somethig like:
> >>>
> >>>  When MAXIMAL/MINIMAL are used, the NaNs are effecively *replaced* by
> >>>  +/- infinity, hence the results will be computed as if infinities were
> >>>  there in the first place.
> >> -1 - this is mixing concerns.  NaNStrategy exists for one specific
> >> purpose - to turn extend the ordering on doubles a total ordering.
> >> Strategies prescribe only how NaNs are to be treated in the
> >> ordering.  Side effects on the input array don't make sense in this
> >> context.  The use case for which this class was created was rank
> >> transformations.  Returning infinities in rank transformations would
> >> blow up computations in these classes.  If a floating point
> >> computations need to transform NaNs, the specific stats / other
> >> classes that do this transformation should perform and document the
> >> behavior.
> >>
> >> Phil
> >>
> > OK. Agreed realized it later.  Hence i have not touched NaNStrategy in my
> > patch(MATH-1132) . Now i have added a separate transformer to transform
> NaNs
> > but it uses NanStrategy.  Please let me know on this as this
> trasnformation
> > itself
> > can be used in different places.
>
> Honestly, I don't see the value of this.  I would be more favorable
> to an addition to MathArrays supporting NaN (or infinity) filtering
> / transformation.  Several years back we made the decision to just
> let NaNs propagate in computations, which basically means that if
> you feed NaNs to [math] you are going to get NaNs out in results.
> If we do get into the business of filtering NaNs (or infinities for
> that matter), I think it is probably best to do that in MathArrays
> or somesuch - i.e., don't decorate APIs with NaN or
> infinity-handling handling descriptions.  That would be a huge task
> and would likely end up bleeding implementation detail into the
> APIs.  As I said above, NaNStrategy has to exist for rank
> transformations to be well-defined.  NaN-handling in floating point
> computations is defined and as long as we fully document what our
> algorithms do, I think it is fair enough to "let the NaNs fall where
> they may" - which basically means users need to do the filtering /
> transformation themselves.
>
> Phil
>
OK. i see the point.

> >
> >>>>>>> When I implemented the method, I explicitly avoided calling
> >> Arrays.sort
> >>>>>>> because it is a general purpose method that is know to be
efficient
> >> only
> >>>>>>> for arrays of at least medium size. In most cases, when
the array
> is
> >>>>>>> small one falls back to a non-recursive method like a very
simple
> >>>>>>> insertion sort, which is faster for smaller arrays.
> >>>>>> Please help me understand here; even java primitive Arrays.sort
does
> >>>>>> an insertion sort for less than 7 elements
> >>>>>> (Refer sort1(double x[], int off, int len))
> >>>>>> So what is it that the custom insertion sort that does differently
> or
> >>>>>> is advantageous. Is it the value 15 elements?
> >>>>> I don't see a reference to 7 elements, neither in the Java6 nor
in
> the
> >>>>> Java 7 doc
> >>>> Please take a look at the sort1 method where there is a first block
in
> >> the
> >>>> code which clearly mentions len < 7
> >>>>     /**
> >>>>      * Sorts the specified sub-array of doubles into ascending order.
> >>>>      */
> >>>>     private static void sort1(double x[], int off, int len) {
> >>>>     // Insertion sort on smallest arrays
> >>>>     if (len < 7) {
> >>>>         for (int i=off; i<len+off; i++)
> >>>>         for (int j=i; j>off && x[j-1]>x[j]; j--)
> >>>>             swap(x, j, j-1);
> >>>>         return;
> >>>>     }
> >>>>     :
> >>>>     :
> >>>>     : code continues for the else part
> >>>>
> >>>> Also the grepcode url
> >>>> <
> >>
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/Arrays.java#Arrays.sort1%28double[]%2Cint%2Cint%29
> >>>> indicates the same
> >>> OK, you were refering to a specific implementation. I was thinking in
> >>> the general case.
> >>>
> >>>> (and in any case the doc explicitly states the algorithms
> >>>>> explained are only implementation notes and are not part of the
> >>>>> specification).
> >>>>>
> >>>> Yes its a part of comments anyways.
> >>>>
> >>>>> However, the most important part for now is the fact that we control
> it
> >>>>> and may be able to implement different NaN strategies. What we have
> >>>>> currently fails.
> >>>>>
> >>>>> I agree on this  and hence here is my take:
> >>>> Leave FIXED as-is and use the earlier insertionSort code (just change
> >> the
> >>>> name to sort rather than hardcoding it as insertionsort) to handle the
> >> case
> >>>> you were mentioning
> >>> If we leave FIXED it as it is done know, even with insertionSort we do
> >>> not really control what happens. It is deterministic as the pivoting
> and
> >>> if/else structure is well defined (both in the selection part and in
> the
> >>> final sort for sub-arrays), but it is untrackable so we can't document
> >> it.
> >>>> Continue to use MAXIMAL/MINIMAL for +/-Inf transformation and that way
> >> we
> >>>> have covered both Inf and nan cases.
> >>> OK.
> >>>
> >>>> Use REMOVED as default for all Percentile Estimation Types. (mostly
> >>>> influenced by R here perhaps)
> >>> This would be fine with me. I have concerns with FIXED only, the other
> >>> strategies all seem quite realistic.
> >>>
> >>> Does anybody else has an advice for FIXED? What are the use case?
> >>>
> >>> best regards,
> >>> Luc
> >>>
> >>>> best regards,
> >>>>> Luc
> >>>>>
> >>>>>>> In the select
> >>>>>>> operation, we know we have small sub-arrays at the call
point.
> Going
> >>>>>>> back to the former insertionSort would recover the good
behavior
> for
> >>>>>>> small arrays, but would in fact not be sufficient to really
> >> implement a
> >>>>>>> NaNStrategy.FIXED. I guess it would be simple to make it
behave
> like
> >>>>>>> NaNStrategy.MAXIMAL but I did not try yet.
> >>>>>>>
> >>>>>>> My point here is rather: can we really and should we really
> implement
> >>>>>>> NaNStrategy.FIXED? Looking at how it is used elsewhere,
it needs to
> >>>>>>> store the original position of the NaNs. It is quite cumbersome.
> >>>>>>>
> >>>>>>> I wonder what is the use case for NaNStrategy.FIXED at all.
> >>>>>>>
> >>>>>>> Going further and looking at the use cases for other NaNStrategy
> >> values,
> >>>>>>> the NaNs are replaced by +/- infinity before sorting, which
is OK
> for
> >>>>>>> ranking as we only rely on the indices, but we use the values
> >> themselves
> >>>>>>> in Percentile. So sometimes, we end up with computing
> interpolations
> >>>>>>> like 0.5 * (x[k] + x[k+1]) or similar. If x[k] is a finite
number
> and
> >>>>>>> x[k+1] has been changed to +infinity, we get +infinity,
instead of
> >> the
> >>>>>>> NaN we should have retrieved without replacement. So here
again,
> I'm
> >> not
> >>>>>>> sure we are doing the right thing.
> >>>>>>>
> >>>>>>> What do you think?
> >>>>>>>
> >>>>>>> best regards,
> >>>>>>> Luc
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >>>>>>> For additional commands, e-mail: dev-help@commons.apache.org
> >>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >>>>>> For additional commands, e-mail: dev-help@commons.apache.org
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >>>>> For additional commands, e-mail: dev-help@commons.apache.org
> >>>>>
> >>>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >>> For additional commands, e-mail: dev-help@commons.apache.org
> >>>
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >> For additional commands, e-mail: dev-help@commons.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message