commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Herbert <alex.d.herb...@gmail.com>
Subject Re: [GSoC][Commons][Statistics][Descriptive] Should Standard Deviation extend Variance & Variance extend Moment class?
Date Mon, 01 Jul 2019 20:47:16 GMT
Re-sent to list.

> On 1 Jul 2019, at 18:49, Gilles Sadowski <gilleseran@gmail.com> wrote:
> 
> Le lun. 1 juil. 2019 à 19:10, Alex Herbert <alex.d.herbert@gmail.com <mailto:alex.d.herbert@gmail.com>>
a écrit :
>> 
>> On 01/07/2019 08:59, Gilles Sadowski wrote:
>>> Hello.
>>> 
>>> Le lun. 1 juil. 2019 à 00:50, Virendra singh Rajpurohit
>>> <virendrasinghrp@gmail.com> a écrit :
>>>> Hi everyone,
>>>> I've been working on descriptive module of "Commons-Statistics". My
>>>> question is should I extend
>>> I'll give hints as if the question mark was after the word "extend".
>>> 
>>> For starter, please read this:
>>>     https://books.google.fr/books?id=ka2VUBqHiWkC&pg=PA81&lpg=PA81&dq=java+inheritance+bloch&source=bl&ots=y_IgRku_OX&sig=ACfU3U3VMmrHT2ILGJQluunGLJcEOMICBA&hl=fr&sa=X&ved=2ahUKEwicsvv6nJPjAhVwx4UKHdU6BFUQ6AEwBXoECAkQAQ#v=onepage&q=java%20inheritance%20bloch&f=false
>>> (and the whole book is advice to follow).
>>> 
>>> Then, there might be some useful information here:
>>>     https://stackoverflow.com/questions/218744/good-reasons-to-prohibit-inheritance-in-java
>>> 
>>> Regards,
>>> Gilles
>> 
>> This issue arises because the algorithms for the mean and the variance
>> share computation and the standard deviation is derived from the variance.
>> 
>> They can be combined using inheritance. The reference above indicates
>> that this is poor practice unless you specifically design for it.
>> Splitting them all apart as separate classes will result in code
>> duplication. This is also considered poor practice without a good
>> reason. However there is one good reason in this case (as I will show
>> later).
> 
> Avoiding duplication is a corollary of inheritance, but not the condition
> for using it.
> There must an "is-a" relationship.
> 
>> The solution in Commons Math was to have a package-private set of moment
>> classes that have an inheritance hierarchy. The public facing classes
>> can then wrap these internal classes and avoid inheritance. However in
>> Commons math some of the public constructors depended on package private
>> moment classes, and one moment class was public. So the API was flawed.
>> This is the subject of an issue which discusses removing the moment
>> classes and avoiding inheritance [1]. The outcome of this discussion
>> (from 2016) is not in the Jira ticket.
>> 
>> There is also the question of whether the third and fourth central
>> moments should be ported from Commons Math with the associated Skewness
>> and Kurtosis statistics. It should be noted that at the limit of this
>> hierarchy the fourth central moment has the data to compute Mean,
>> Variance+Standard Deviation, Skewness and Kurtosis.
>> 
>> Although it is possible to compute many statistics from a moment I would
>> say it can be assumed if the user has selected a specific statistic
>> class then they are only interested in 1 statistic. Construction of a
>> flexible SummaryStatistics class to provide multiple statistics should
>> be part of the design and so objects to compute different moments and
>> the ability to use them to compute different statistics should be part
>> of the design. This can be addressed separately but should be kept in mind.
>> 
>> If inheritance is eliminated at the jump from 1st to 2nd moment it will
>> result in a small amount of code duplication.:
>> 
>> Moment1 = 4 lines
>> Moment2 = 1 extra line (5 total)
>> Moment3 = 4 extra lines (9 total)
>> Moment4 = 4 extra lines (13 total)
>> 
>> However all protected variables used in the hierarchy can be local to
>> the method and the code (although duplicated) may be easier to follow.
>> 
>> If the requirements of a moment are specified in an interface it can be
>> seen that the hierarchy is simple:
>> 
>> interface FirstMoment {
>>     long getN();
>>     double getM1();
>>     void accept(double d);
>>     void combine(FirstMoment m1);
>> }
>> 
>> interface SecondMoment extends FirstMoment {
>>     double getM2();
>>     void combine(SecondMoment m2);
>> }
>> 
>> interface ThirdMoment extends SecondMoment {
>>     double getM3();
>>     void combine(ThirdMoment m3);
>> }
> 
> "is-a" condition is not met.

Do you mean that ThirdMoment “is-a” SecondMoment is not true? While this is correct the
way the algorithm is implemented means that a ThirdMoment does have all the data for the lower
order moments.

> 
> Another direction to explore:
> 
> interface Moment<T> {
>    long getN();
>    double getValue();
>    void accept(double);
>    void combine(Moment<T> m);
> }

And <T> is an enum?

I do not see how this helps this:

ThirdMoment m3 = …
Variance var = new Variance(m3);

If ThirdMoment is to be used generically it needs a method to obtain the value for all moments
from 3 to 1. The single ‘getValue()’ is not enough. It would require:

interface Moment<T extends MomentOrder> {
   long getN();
   double getValue(M extends MomentOrder);
   void accept(double);
   void combine(Moment<T> m);
}

Where M is <= T in the enum ranking. So basically it is an integer and you can restrict
the combine method using self-types:

interface Moment<T extends Moment<T>> {
   long getN();
   double getValue(int order);
   void accept(double);
   void combine(T m);
}

With this you have a contract that a Moment can only be combined with instances of the same
class. But you do not have a restriction on what orders you will support. So for example the
Variance constructor is:

<T extends Moment<T>> Variance(Moment<T> m) {
    // Now stuck with a generic moment. Data can be extracted with getValue(int)
    // to create a concrete internal moment. But then the input moment is not used
    // by reference. To do so requires Variance is parameterised with T as well.
}

Note that all this is to support a SummaryStatistics type class that uses SecondMoment to
obtain a mean, standard deviation and variance. If this is the extent of the usage then support
for a generic hierarchy is not required.

So put all any moments as package private and they can be changed in the future. Basically
see how far the new API in numbers can be taken before a decision must be made on whether
a moment hierarchy is to be part of the API. At current for Mean, Variance and StdDev just
a single SecondMoment implementation is required.


> 
> Regards,
> Gilles
> 
>> Note that the combine method is not in Commons Math and allows the
>> classes to be used as aggregators, for example in the collect method of
>> the Streams API.
>> 
>> The issue that I see is that the requirement to be used as an aggregator
>> adds a new method to combine with another instance of the same class at
>> each level of the hierarchy. This can result for example in a
>> SecondMoment being updated with a FirstMoment and the result is invalid.
>> 
>> To prevent this requires exceptions to be thrown when lower moments are
>> combined with higher moments. That is not a friendly API. The only
>> absolute way I see to prevent this is to remove inheritance and take on
>> the burden of code duplication at each additional level of moment.
>> Alternatively exploiting the inheritance to remove code duplication will
>> require careful use. This could be done with package level classes.
>> 
>> 
>> Here is a suggestion of one way forward:
>> 
>> Create public Mean, Variance and StandardDeviation classes. These define
>> the public API and what you are to support.
>> 
>> Create a package private SecondMoment class. This class can be used by
>> both Variance and StandardDeviation to do the computation. Only the
>> final step to compute the output statistic is different.
>> 
>> There will be no inheritance between SecondMoment and Mean.
>> 
>> If ThirdMoment and FourthMoment are later added then this can be done
>> with code duplication. The use of duplication should be documented as a
>> deliberate avoidance of inheritance to prevent combination of moments of
>> different orders.
>> 
>> Note that a SecondMoment can be used as the workhorse for a
>> SummaryStatistics class. This is if the Commons Math pattern is used
>> where the Mean and Variance can accept a SecondMoment in a constructor
>> or have a static computation method which accepts the moment data.
>> 
>> TBD:
>> 
>> I do not see why the SecondMoment could not be public. If it is a
>> standalone class free from inheritance it may be of use to someone.
>> Here's an example if the SecondMoment exposes all the data using
>> accessors that can then be used:
>> 
>> double[] d = { 1.1, 2.2, 3.3 };
>> SecondMoment m2 = Arrays.stream(d).collect(SecondMoment::new,
>>                                            SecondMoment::accept,
>>                                            SecondMoment::combine);
>> double mean = m2.getM1();
>> StandardDeviation sd = new StandardDeviation(m2);
>> sd.setBiasCorrected(true);
>> double stdDev = sd.getStandardDeviation();
>> 
>> One issue with removing inheritance is how to write a SummaryStatistics
>> class that could for example use a ThirdMoment to compute Mean,
>> StandardDeviation and Skewness:
>> 
>> double[] d = { 1.1, 2.2, 3.3 };
>> ThirdMoment m3 = Arrays.stream(d).collect(ThirdMoment::new,
>>                                            ThirdMoment::accept,
>>                                            ThirdMoment::combine);
>> double mean = m3.getM1();
>> StandardDeviation sd = new StandardDeviation(m3); // INHERITANCE EXPLOITED
>> sd.setBiasCorrected(true);
>> double stdDev = sd.getStandardDeviation();
>> double skew = new Skewness(m3).getSkewness();
>> 
>> Without inheritance each could be done using static helper methods:
>> 
>> static double getVariance(long n, double m1, double m2);
>> static double getSkewness(long n, double m1, double m2, double m3);
>> 
>> Note: If the SummaryStatistics is only to support the same functionality
>> as CM SummaryStatistics then this is all possible using a SecondMoment
>> as described above.
>> 
>> So I seem to have created some more questions and shown part of a
>> possible solution. Without the API for the SummaryStatistics it may not
>> be possible to choose a solution. The question is how flexible does the
>> API need to be with regard to reusing the moment data?
>> 
>> Opinions on this?
>> 
>> 
>> [1] https://issues.apache.org/jira/browse/MATH-1228
>> 
>>> 
>>>> Variance class in StandardDeviaiton class?
>>>> Also, should Moment classes be developed and inherited in Variance? As of
>>>> now, I created only FirstMoment  & Variance class. If the answer for
>>>> inheritance is NO, should we rename FirstMoment to Mean?
>>>> Thanks
>>>> --
>>>> Warm Regards
>>>> *Virendra Singh Rajpurohit*
>>>> *University of Petroleum and Energy Studies,Dehradun*
>>>> Linkedin:https://www.linkedin.com/in/virendra-singh-rajpurohit
>>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message