commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Nguyen <bennguye...@gmail.com>
Subject RE: [GSoC][Regression][OLS] RegressionResults/OLSResults Re-design
Date Wed, 24 Jul 2019 16:00:28 GMT
Hello,
Thank you for your feedback, I will answer your questions:

> Why is the constructor "protected"?

I had thought it should be restricted to only work with ols.regress() but then it should’ve
be default.
But public might be the way to go, that restriction is probably not necessary….?

> Why is the computation performed in that constructor although the data is
> stored in "OLSRegression"? [This looks like what PMD complains about.]

Sorry, I don’t fully understand this question…. 
>From what I understand after searching about the Data Class violation:
Quote: “[OLSResults ]interface reveals more data than behaviour. Low values (less than 30%)
[OLSResults showed 20%] indicate that the class reveals much more data than behaviour, which
is a sign of poor encapsulation.”
Quote under “Weight Of Class (WOC)”
https://pmd.github.io/latest/pmd_java_metrics_index.html


> It's good to try and have encapsulation preserved through accessors but in
> that case it's broken because some return a reference to an array (entries
> are mutable).

I will fix that with a deepCopy() method, though I’m not sure where to put it, in same class?
Could also do one-liner like this for 2D array:
    public double[][] getBetaVariance() {
        return Arrays.stream(betaVariance).map(el -> el.clone()).toArray($ -> hatMatrix.clone());
    }
But I think .clone() is okay for primitive type 1D arrays? If not, System.arraycopy() should
be okay?

> Implementing an interface "RegressionResults" suggest that application
> should program against it, and not the specific "OLSResult"; however the
> latter has more methods than the interface defines.

Yeah, I think this is a conflict with:

> RegressionResults res = reg.regress(data);

Since not all regression types calculate the same statistics, ex: OLS has R-squared, TSS,
RSS
While GLS doesn’t officially have that (unless you count pseudo-solutions out there)
But of course, all regression types have many similar statistics like beta, beta variance,
residuals, etc
So I’m not sure how to structure an interface like RegressionResults to account for this,
I will look carefully into the factory method design pattern and understand it properly to
solve this, 
I think I might’ve missed things that would address this….? I will now definitely refactor
the current PR 23 into 
a proper factory method design as well with consideration for inner-interfaces you mentioned.

> What is the most useful functionality?
> What is the most robust functionality?
> IOW:  Is there a case for not computing everything?

My consideration (though I don’t have professional experience in Data science/engineering)
is maybe someone needs 
to calculate one statistic from many new samples (constantly coming in) at once…. This case
might not exist, but it’s what I thought of,
and considering for cases I haven’t thought of as well. The old code allowed for this so
I didn’t want to take that away….

> There are many redundant computations...

Yes, PR 23 right now is kind of v1 to show everything works. I was focused on getting the
port to work 100% with EJML first with old code base, 
I will look at optimization between methods (and factory methods) more deeply now (another
optimization found with decomposing QR twice as well). 

> > The RegressionResults interface could also include some kind of output method to
get all statistics in a file or displayed?
> Why should it be done within the class if it provides accessors to all
> its contents?

This issue asked for an ANOVA table printout: https://issues.apache.org/jira/browse/MATH-1097
So I was considering that (even though calculating F-stat and p-value would be another challenge),
But this lead me to consider a summary statistics printout like one found in R, for user convenience…
Are you saying this is not necessary for this application?

Again, thank you for your feedback,
-Ben Nguyen



From: Gilles Sadowski
Sent: Tuesday, July 23, 2019 6:45 PM
To: Commons Developers List
Subject: Re: [GSoC][Regression][OLS] RegressionResults/OLSResults Re-design

Hello.

[Please note that the first tag (in square bracket) on the "Subject" line should
be the (short) name of the component, i.e. in this case, "[Statistics]".]

Le mar. 23 juil. 2019 à 20:31, Ben Nguyen <bennguyenib@gmail.com> a écrit :
>
> Hello,
>
> Issue was noticed as a PMD violation during PR’s Travis CI build.
> “The class 'OLSResults' is suspected to be a Data Class (WOC=20.000%, NOPA=0, NOAM=4,
WMC=6)”
> https://github.com/apache/commons-statistics/pull/23
>
> UML for current design:
> https://github.com/BBenNguyenn/commons-statistics/blob/gsoc-milestone-1/commons-statistics-regression/UML_current.png
>
> The current design is my (probably wrong) interpretation on a suggestion from Eric Barnhill.
> The idea is that all calculation functionality is done in the OLSRegression class, which
passes itself to the OLSResults constructor via regress() method.

Better directly show the relevant code:

public class OLSRegression extends AbstractRegression {
    // ...

    public RegressionResults regress() {
        return new OLSResults(this);
    }
}

public class OLSResults implements RegressionResults, Serializable {
    protected OLSResults(OLSRegression ols) {
        this.beta = ols.estimateBeta();
        this.betaVariance = ols.estimateBetaVariance();
        this.betaStandardErrors = ols.estimateBetaStandardErrors();
        this.residuals = ols.estimateResiduals();
        this.regressionStandardErrors = ols.estimateRegressionStandardError();
        this.regressandVariance = ols.estimateRegressandVariance();

        this.hatMatrix = ols.calculateHat().toArray2D();
        this.totalSumOfSquares = ols.calculateTotalSumOfSquares();
        this.residualSumOfSquares = ols.calculateResidualSumOfSquares();
        this.regRSquared = ols.calculateRSquared();
        this.adjustedRSquared = ols.calculateAdjustedRSquared();
    }
}

Why is the constructor "protected"?
Why is the computation performed in that constructor although the data is
stored in "OLSRegression"? [This looks like what PMD complains about.]

> The OLSResults (implementing RegressionResults interface) calls on statistics calculating
methods from OLSRegression and saves it as private fields with getters.

It's good to try and have encapsulation preserved through accessors but in
that case it's broken because some return a reference to an array (entries
are mutable).

Implementing an interface "RegressionResults" suggest that application
should program against it, and not the specific "OLSResult"; however the
latter has more methods than the interface defines.

> I am thinking this is good for users who want everything to be calculated only once to
then be used multiple times (without the recalculation or having to save all statistics themselves).

What is the most useful functionality?
What is the most robust functionality?
IOW:  Is there a case for not computing everything?

These are two of the methods called in the constructor above:

    public double calculateRSquared() {
        return 1 - calculateResidualSumOfSquares() /
calculateTotalSumOfSquares();
    }

    public double calculateAdjustedRSquared() {
        final double n = getX().numRows();
        if (getHasIntercept()) {
            return 1 - (1 - calculateRSquared()) * (n / (n - getX().numCols()));
        } else {
            return 1 -
                   (calculateResidualSumOfSquares() * (n - 1)) /
                       (calculateTotalSumOfSquares() * (n - getX().numCols()));
        }
    }

There are many redundant computations...

> The RegressionResults interface could also include some kind of output method to get
all statistics in a file or displayed?

Why should it be done within the class if it provides accessors to all
its contents?

> Then, for those who only want a single calculated statistic, the methods to calculate
in OLSRegression are also public.

Then even assuming that you fix somehow "OLSResults" to compute
each result only once, a user who directly calls the above two methods
will get subpar performance.

> This design does cause that PMD error, so any suggestions or alternate designs to correct
my interpretation?

There are definite problems with your implementation that need fixing;
but it's not necessarily proof of a bad design.
However it's clearly not the most obvious one, which IMO would go along
the lines

RegressionData data = RegressionDataLoader.from(x, y, hasIntercept);
Regression reg = new OLSRegression();
RegressionResults res = reg.regress(data);

This is all untested of course, and could well fall short for advanced
usage (stream?).  Your design (or yet another) could be an attempt
to handle particular use-cases which the obvious approach does not
cover (?) ...

Regards,
Gilles

>
> Thank you,
> Cheers,
> -Ben Nguyen
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message