spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Filipp Zhinkin <filipp.zhin...@gmail.com>
Subject [MLLib] Logistic Regression and standadization
Date Sun, 08 Apr 2018 20:09:28 GMT
Hi all,

While migrating from custom LR implementation to MLLib's LR implementation
my colleagues noticed that prediction quality dropped (accoring to
different business metrics).
It's turned out that this issue caused by features standardization perfomed
by MLLib's LR: disregard to 'standardization' option's value all features
are scaled during loss and gradient computation (as well as in few other
places):
https://github.com/apache/spark/blob/6cc7021a40b64c41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229

According to comments in the code, standardization should be implemented
the same way it was implementes in R's glmnet package. I've looked through
corresponding Fortran code, an it seems like glmnet don't scale features
when you're disabling standardisation (but MLLib still does).

Our models contains multiple one-hot encoded features and scaling them is a
pretty bad idea.

Why MLLib's LR always scale all features? From my POV it's a bug.

Thanks in advance,
Filipp.

Mime
View raw message