commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <phil.ste...@gmail.com>
Subject Re: [math] use case for NaNStrategy.FIXED?
Date Sun, 29 Jun 2014 17:25:58 GMT
On 6/29/14, 9:48 AM, venkatesha murthy wrote:
> On Sun, Jun 29, 2014 at 1:25 AM, Phil Steitz <phil.steitz@gmail.com> wrote:
>
>> On 6/25/14, 12:12 AM, Luc Maisonobe wrote:
>>> Le 25/06/2014 06:05, venkatesha murthy a écrit :
>>>> On Wed, Jun 25, 2014 at 12:21 AM, Luc Maisonobe <luc@spaceroots.org>
>> wrote:
>>>>> Hi Venkat,
>>>>>
>>>>> Le 23/06/2014 21:08, venkatesha murthy a écrit :
>>>>>> On Tue, Jun 24, 2014 at 12:08 AM, Luc Maisonobe <luc@spaceroots.org>
>>>>> wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> While looking further in Percentile class for MATH-1120, I have
found
>>>>>>> another problem in the current implementation. NaNStrategy.FIXED
>> should
>>>>>>> leave the NaNs in place, but at the end of the KthSelector.select
>>>>>>> method, a call to Arrays.sort moves the NaNs to the end of the
small
>>>>>>> sub-array. What is really implemented here is therefore closer
to
>>>>>>> NaNStrategy.MAXIMAL than NaNStrategy.FIXED. This always occur
in the
>>>>>>> test cases because they use very short arrays, and we directly
>> switch to
>>>>>>> this part of the select method.
>>>>>> Are NaNs considered higher than +Inf ?
>>>>>> If MAXIMAL represent replacing for +inf ; you need something to
>>>>>> indicate beyond this for NaN.
>>>>> Well, we can just keep the NaN themselves and handled them
>>>>> appropriately, hoping not to trash performances too much.
>>>>>
>>>> Agreed.
>>>>
>>>>>> What is the test input you see an issue and what is the actual error
>>>>>> you are seeing. Please share the test case.
>>>>> Just look at PercentileTest.testReplaceNanInRange(). The first check
in
>>>>> the test corresponds to a Percentile configuration at 50% percentile,
>>>>> and NaNStrategy.FIXED. The array has an odd number of entries, so the
>>>>> 50% percentile is exactly one of the entries: the one at index 5 in the
>>>>> final array.
>>>>>
>>>>> The initial ordering is { 0, 1, 2, 3, 4, NaN, NaN, 5, 7, NaN, 8 }. So
>>>>> for the NaNStrategy.FIXED setting, it should not be modified at all in
>>>>> the selection algorithm and the result for 50% should be the first NaN
>>>>> of the array, at index 5. In fact, due to the Arrays.sort, we *do*
>>>>> reorder the array into  { 0, 1, 2, 3, 4, 5, 7, 8, NaN, NaN, NaN }, so
>>>>> the result is 5.
>>>>>
>>>>> Agreed. just verified by putting back the earlier insertionSort
>> function.
>>>>> If we use NaNStrategy.MAXIMAL and any quantile above 67%, we get as a
>>>>> result Double.POSITIVE_INFINITY instead of Double.NaN.
>>>>>
>>>>> If we agree to leave FIXED as unchanged behaviour with your
>> insertionSort
>>>> code; then atleast MAXIMAL/MINIMAL should be allowed for transformation
>> of
>>>> NaN to +/-Inf
>>> I'm fine with it, as long as we clearly state it in the NaNStrategy
>>> documentation, saying somethig like:
>>>
>>>  When MAXIMAL/MINIMAL are used, the NaNs are effecively *replaced* by
>>>  +/- infinity, hence the results will be computed as if infinities were
>>>  there in the first place.
>> -1 - this is mixing concerns.  NaNStrategy exists for one specific
>> purpose - to turn extend the ordering on doubles a total ordering.
>> Strategies prescribe only how NaNs are to be treated in the
>> ordering.  Side effects on the input array don't make sense in this
>> context.  The use case for which this class was created was rank
>> transformations.  Returning infinities in rank transformations would
>> blow up computations in these classes.  If a floating point
>> computations need to transform NaNs, the specific stats / other
>> classes that do this transformation should perform and document the
>> behavior.
>>
>> Phil
>>
> OK. Agreed realized it later.  Hence i have not touched NaNStrategy in my
> patch(MATH-1132) . Now i have added a separate transformer to transform NaNs
> but it uses NanStrategy.  Please let me know on this as this trasnformation
> itself
> can be used in different places.

Honestly, I don't see the value of this.  I would be more favorable
to an addition to MathArrays supporting NaN (or infinity) filtering
/ transformation.  Several years back we made the decision to just
let NaNs propagate in computations, which basically means that if
you feed NaNs to [math] you are going to get NaNs out in results. 
If we do get into the business of filtering NaNs (or infinities for
that matter), I think it is probably best to do that in MathArrays
or somesuch - i.e., don't decorate APIs with NaN or
infinity-handling handling descriptions.  That would be a huge task
and would likely end up bleeding implementation detail into the
APIs.  As I said above, NaNStrategy has to exist for rank
transformations to be well-defined.  NaN-handling in floating point
computations is defined and as long as we fully document what our
algorithms do, I think it is fair enough to "let the NaNs fall where
they may" - which basically means users need to do the filtering /
transformation themselves.

Phil
>
>>>>>>> When I implemented the method, I explicitly avoided calling
>> Arrays.sort
>>>>>>> because it is a general purpose method that is know to be efficient
>> only
>>>>>>> for arrays of at least medium size. In most cases, when the array
is
>>>>>>> small one falls back to a non-recursive method like a very simple
>>>>>>> insertion sort, which is faster for smaller arrays.
>>>>>> Please help me understand here; even java primitive Arrays.sort does
>>>>>> an insertion sort for less than 7 elements
>>>>>> (Refer sort1(double x[], int off, int len))
>>>>>> So what is it that the custom insertion sort that does differently
or
>>>>>> is advantageous. Is it the value 15 elements?
>>>>> I don't see a reference to 7 elements, neither in the Java6 nor in the
>>>>> Java 7 doc
>>>> Please take a look at the sort1 method where there is a first block in
>> the
>>>> code which clearly mentions len < 7
>>>>     /**
>>>>      * Sorts the specified sub-array of doubles into ascending order.
>>>>      */
>>>>     private static void sort1(double x[], int off, int len) {
>>>>     // Insertion sort on smallest arrays
>>>>     if (len < 7) {
>>>>         for (int i=off; i<len+off; i++)
>>>>         for (int j=i; j>off && x[j-1]>x[j]; j--)
>>>>             swap(x, j, j-1);
>>>>         return;
>>>>     }
>>>>     :
>>>>     :
>>>>     : code continues for the else part
>>>>
>>>> Also the grepcode url
>>>> <
>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/Arrays.java#Arrays.sort1%28double[]%2Cint%2Cint%29
>>>> indicates the same
>>> OK, you were refering to a specific implementation. I was thinking in
>>> the general case.
>>>
>>>> (and in any case the doc explicitly states the algorithms
>>>>> explained are only implementation notes and are not part of the
>>>>> specification).
>>>>>
>>>> Yes its a part of comments anyways.
>>>>
>>>>> However, the most important part for now is the fact that we control
it
>>>>> and may be able to implement different NaN strategies. What we have
>>>>> currently fails.
>>>>>
>>>>> I agree on this  and hence here is my take:
>>>> Leave FIXED as-is and use the earlier insertionSort code (just change
>> the
>>>> name to sort rather than hardcoding it as insertionsort) to handle the
>> case
>>>> you were mentioning
>>> If we leave FIXED it as it is done know, even with insertionSort we do
>>> not really control what happens. It is deterministic as the pivoting and
>>> if/else structure is well defined (both in the selection part and in the
>>> final sort for sub-arrays), but it is untrackable so we can't document
>> it.
>>>> Continue to use MAXIMAL/MINIMAL for +/-Inf transformation and that way
>> we
>>>> have covered both Inf and nan cases.
>>> OK.
>>>
>>>> Use REMOVED as default for all Percentile Estimation Types. (mostly
>>>> influenced by R here perhaps)
>>> This would be fine with me. I have concerns with FIXED only, the other
>>> strategies all seem quite realistic.
>>>
>>> Does anybody else has an advice for FIXED? What are the use case?
>>>
>>> best regards,
>>> Luc
>>>
>>>> best regards,
>>>>> Luc
>>>>>
>>>>>>> In the select
>>>>>>> operation, we know we have small sub-arrays at the call point.
Going
>>>>>>> back to the former insertionSort would recover the good behavior
for
>>>>>>> small arrays, but would in fact not be sufficient to really
>> implement a
>>>>>>> NaNStrategy.FIXED. I guess it would be simple to make it behave
like
>>>>>>> NaNStrategy.MAXIMAL but I did not try yet.
>>>>>>>
>>>>>>> My point here is rather: can we really and should we really implement
>>>>>>> NaNStrategy.FIXED? Looking at how it is used elsewhere, it needs
to
>>>>>>> store the original position of the NaNs. It is quite cumbersome.
>>>>>>>
>>>>>>> I wonder what is the use case for NaNStrategy.FIXED at all.
>>>>>>>
>>>>>>> Going further and looking at the use cases for other NaNStrategy
>> values,
>>>>>>> the NaNs are replaced by +/- infinity before sorting, which is
OK for
>>>>>>> ranking as we only rely on the indices, but we use the values
>> themselves
>>>>>>> in Percentile. So sometimes, we end up with computing interpolations
>>>>>>> like 0.5 * (x[k] + x[k+1]) or similar. If x[k] is a finite number
and
>>>>>>> x[k+1] has been changed to +infinity, we get +infinity, instead
of
>> the
>>>>>>> NaN we should have retrieved without replacement. So here again,
I'm
>> not
>>>>>>> sure we are doing the right thing.
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> best regards,
>>>>>>> Luc
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message