commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilles Sadowski <>
Subject Re: [GSoC][STATISTICS][Regression] Architecture Implementation Suggestions
Date Thu, 16 May 2019 13:26:06 GMT

Le jeu. 16 mai 2019 à 10:02, Ben Nguyen <> a écrit :
> Hello,
> I have some broad general ideas about how the regression module should be structured,
as outlined in my proposal briefly with UMLs
> This is the current implementation inside commons-math-stat-regression:

It seems there is/was an image here but I don't see it.

For this kind of information, please use JIRA (and provide the link here).

> This is my propsed idea, where the structure was partly inspired by SuanShu since it
supported multiple types of regression (including logistic):
> Disclaimer: I have only studied some econometrics and second year computer science in
university, so I have zero professional data engineering experience, but am excited to start
learning with this project. So, I don’t currently know the exact needs of data engineers
in regards to this module and am learning as I go….which is why I would very much appreciate
any input on the kinds of requirements data engineers would want from this regression module.

Basing a design on use-cases is very useful.
You should collect a range of them (small/large datasets, in-memory/stream,
dense/sparse) in order to figure what parts of the code can be common and
what requires specialization.

> From someone who has used the current implementation or will use this new implementation:
> What would make your life easier?
> What should definitely be kept?
> What should be added/improved?
> Any specific features or design criterions?
> Any changes or radically different approaches to the following idea?

Good questions!
What are your answers? ;-)

> Note: OLS, GLS and Logistic regression are the first to be implemented, with focus to
make architectural support for further additions. Changes will make use of new Java 8 features,
specifically the Java Streams API to improve performance and readability.

I'd suggest to select one and start coding, without fearing that you'll
probably have to change a lot of it as more use-cases are collected.

> Updates to this proposed implementation UML in my proposal:
> “statistics-regression-reqLinearMath” will be replaced with EJML as suggested by
Mr. Eric Barnhill
> This will include a custom matrix class extended from EJML’s SimpleBase -> StatisticsMatrix
> So if we decide to use an Apache Commons implementation of matrices later on, only this
class should be changed internally.

Good precaution; but I doubt that we can include everything in a
single class.
How to best encapsulate the linear algebra (external) library is a
subject on its own, worth its own thread:  Cramming many questions
in a single post makes it likely that some will be missed by some
people who might later on question the chosen path.  [External
dependencies is a sensitive issue, in Commons...]

Also, I remind that we need to take into account the comparative
benchmarks which I posted recently.  [Even if just to conclude that
EJML has overwhelming advantages (which?) that make it more
suitable than its "competitors".]

> Abstract classes should have interfaces above them or perhaps just be interfaces if a
simpler approach is implemented (ie minimal OOP)
> Notes about this proposed implementation:
> AbstractVariables and it’s child classes may not be necessary, ie just Estimators and
Residuals classes
> Or perhaps it’s best to follow the current implementation’s example and have a single
class per regression type for hierarchy simplicity (but risking redundancies)?
> I have not looked into specific data members or individual methods yet. So far just taking
notes from the current implementation and SuanShu
> The “statistics-regression-updating” components have quite complex algorithms which
will require a lot of time for me to understand completely
> So for now, I see myself making minimal changes to them, prioritizing the new “stored”

IMHO, this will better discussed once an initial implementation is shown
(or perhaps, as Eric suggested, with unit tests).

Again, better to start a new thread for each specific question, possibly backed
with a new JIRA report focussed on a particular task (see "Create sub-tasks"
on JIRA).

> RegressionDataLoader’s purpose is to:
> provide a clean input interface
> and to ensure that data from say double[ ][ ] is only converted to working form as a
StatisticsMatrix object once

Until proven wrong, I'm a proponent of separating I/O from "useful"
I.e. I suggest that we consider on the one hand what API is required for all the
intented functionalitites, and on the other (in a *different* "maven
module"), all the
conversions that may be implemented for the convenience of users.

> while allowing multiple types of regression to be calculated via a universal form….
> which could become a challenge once details are in order.
> So this is the current state of my plan, with your input, I will move to the next steps,
plan more details and start creating the software flowchart.
> Thank you in advance for any advice/suggestions,

To summarize, my main suggestion is to split this post in more
manageable chunks.


> -Ben Nguyen

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message