commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Nguyen <bennguye...@gmail.com>
Subject [GSoC][STATISTICS][Regression] Architecture Implementation Suggestions
Date Thu, 16 May 2019 08:02:05 GMT
Hello,

I have some broad general ideas about how the regression module should be structured, as outlined
in my proposal briefly with UMLs
This is the current implementation inside commons-math-stat-regression:




This is my propsed idea, where the structure was partly inspired by SuanShu since it supported
multiple types of regression (including logistic):
https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear

Disclaimer: I have only studied some econometrics and second year computer science in university,
so I have zero professional data engineering experience, but am excited to start learning
with this project. So, I don’t currently know the exact needs of data engineers in regards
to this module and am learning as I go….which is why I would very much appreciate any input
on the kinds of requirements data engineers would want from this regression module. 

>From someone who has used the current implementation or will use this new implementation:
- What would make your life easier? 
- What should definitely be kept? 
- What should be added/improved?
- Any specific features or design criterions? 
- Any changes or radically different approaches to the following idea?
Note: OLS, GLS and Logistic regression are the first to be implemented, with focus to make
architectural support for further additions. Changes will make use of new Java 8 features,
specifically the Java Streams API to improve performance and readability.



Updates to this proposed implementation UML in my proposal:
- “statistics-regression-reqLinearMath” will be replaced with EJML as suggested by Mr.
Eric Barnhill
o This will include a custom matrix class extended from EJML’s SimpleBase -> StatisticsMatrix
o So if we decide to use an Apache Commons implementation of matrices later on, only this
class should be changed internally.
- Abstract classes should have interfaces above them or perhaps just be interfaces if a simpler
approach is implemented (ie minimal OOP)
Notes about this proposed implementation:
- AbstractVariables and it’s child classes may not be necessary, ie just Estimators and
Residuals classes
- Or perhaps it’s best to follow the current implementation’s example and have a single
class per regression type for hierarchy simplicity (but risking redundancies)?
- I have not looked into specific data members or individual methods yet. So far just taking
notes from the current implementation and SuanShu
- The “statistics-regression-updating” components have quite complex algorithms which
will require a lot of time for me to understand completely
o So for now, I see myself making minimal changes to them, prioritizing the new “stored”
components.
- RegressionDataLoader’s purpose is to: 
o provide a clean input interface 
o and to ensure that data from say double[ ][ ] is only converted to working form as a StatisticsMatrix
object once 
• while allowing multiple types of regression to be calculated via a universal form….

• which could become a challenge once details are in order.

So this is the current state of my plan, with your input, I will move to the next steps, plan
more details and start creating the software flowchart.

Thank you in advance for any advice/suggestions,
-Ben Nguyen

Mime
View raw message