commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Nguyen <>
Subject RE: [statistics] Pull request for GLSMultipleLinearRegression
Date Thu, 23 May 2019 14:25:32 GMT

There is currently a transition from the commons-math-stat libraries to the new commons-statistics
library. I am working on regression related design for my Google Summer of Code project. I
am a new contributor and would love to work with more people who have used these tools extensively
for more insights.

The transition is mostly in the design stages. We are still figuring out essential problems
like which linear math library to use (not from commons-math since its outdated) and designing
a better/more flexible UI.

I have not looked into GLS as in-depth yet (as much as OLS or the new LogisticRegression component),
perhaps you can help contribute to the GLS component to ensure your needs are met. Our goal
is also to maximize efficiencies in all areas, utilizing Java 8 features such as the Streams
API where it would increase performance.

Issue for regression component, please post insights here as well:
GitHub Repo:

Thank you for your post,
-Ben Nguyen

From: Елена Картышева
Sent: Thursday, May 23, 2019 8:44 AM
To: dev
Subject: [statistics] Pull request for GLSMultipleLinearRegression


I would like to propose a pull request implementing an option to use variance vector instead
of covariance matrix. It allows users to avoid unnecessary memory usage and excessive computation
in case of uncorrelated but heteroscedastic errors thus making it possible to work with huge
input matrices. Using variance vector in such cases allows to reduce time complexity from
O(N^2) to just O(N) (where N is a number of observations) and dramatically reduce memory usage.
For example, in my practice arose a need to train generalized linear model. Usage of Iteratively
reweighted least squares algorithm requires weighted regression with more than a million observations.
Current implementation would require approximately 12 terabytes of memory while patched version
needs only 8 megabytes. Since IRLS is iterative algorithm a million-times complexity reduction
is also pretty handy.

Sincerely yours, Elena Kartysheva.

To unsubscribe, e-mail:
For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message