Hi Phil,
thanks for reviewing the multiple linear regression implementations and
setting up the R/NIST data tests. I finally got around to installing R
and can now run them too.
Phil Steitz wrote:
> While clear and elegant from a matrix algebra standpoint, the "nailve"
> implementation in OLSMultipleLinearRegression has bad numerical
> qualities. It is well known that solving the normal equations directly
> does not give good numerics. I just added some tests to actually verify
> parameter values, using the classic "Longly" dataset, for which NIST
> provides certified statistics. This is a "hard" design matrix. R was
> able to get to within 1E8 of the certified parameter values.
> OLSMultipleLinearRegression can only get 1E1.
The OLS implementation has been added as a simple byproduct of the GLS
case  which is the main one I have needed for hypothesis testing  as
it came "for free" with unitary covariance.
True  the emphasis was on clarity and formulaic simplicity. And also
following the old Donald Knuth maxim "optimization is the root of all
evil". But it seems like there is a need for refinement of the
implementation  the devil raised his head :)
> We have talked in the past about providing an implementation based on QR
> decomposition. Anyone up for using the QR decomposition that we now
> have to do this? I really think we need to do it (or something else to
> improve numerics) before releasing this class. I will get to it
> eventually, but am a little pegged at the moment. I will review and
> apply patches if someone is willing to do the implementation. I can
> also explain here or offline how the R tests and NIST datasets work, as
> these are useful in validating code.
I'd be happy to improve the impl. I'm getting my head around R and
NIST, but perhaps a chat offline would not hurt!
> Another thing that we should think about before releasing any of this
> stuff is the completeness of the API. Many standard regression
> statistics are missing. If we are going to stick with the Interface /
> Implementation setup, we need to get the right stuff into the
> interface. It is also awkward to have to insert "1"'s in the design
> matrix to get an intercept term computed. This is convenient for
> implementation, but awkward for users. A more natural setup (IMHO)
> would be to expose a "noIntercept" or "hasIntercept" property for the
> model.
No problem with adding other statistics  let's just decide on what is
the stardard regression API.
And finally, how do you see the no/hasIntercept model working?
Cheers

To unsubscribe, email: devunsubscribe@commons.apache.org
For additional commands, email: devhelp@commons.apache.org
