Hi Valeriy,
Let me make sure we are on the same page.
"the current mllib implementation returns exactly the same model whether
standardization is turned on or off. " This should be corrected as "the
current mllib implementation returns exactly the same model whether
standardization is turned on or off, given regularization is 0; otherwise,
they are expected not the same"
We expect
1. R glmnet and Spark ML share the same behavior, given all other
conditions are the same.
1.1 Followed by 1, If regularization parameter is not zero, Spark ML
would output 2 different models depending on whether standardization is
turned on or off.
The easiest way to check 1.1 is change setStandardization(false) to true
for a test with regularization != 0, and run the test again which is
expected to be failed.
On Fri, Apr 27, 2018 at 3:08 PM, Valeriy Avanesov <acopich@gmail.com> wrote:
> Hi all,
>
> maybe I'm missing something, but from what was discussed here I've
> gathered that the current mllib implementation returns exactly the same
> model whether standardization is turned on or off.
>
> I suggest to consider an R script (please, see below) which trains two
> penalized logistic regression models (with glmnet) with and without
> standardization. The models are clearly different.
>
> BTW. If penalization is turned off, the models are exactly the same.
>
> Therefore, the current mllib implementation doesn't follow glmnet. So,
> does that make it a bug?
> library(glmnet)
> library(e1071)
>
> set.seed(13)
>
> # generate synthetic data
> X = cbind(500:500, (500:500)*1000)/100000
>
> y = sigmoid(X %*% c(1, 1))
> y = rbinom(y, 1, y)
>
> # define two testing points
> xTest = rbind(c(10, 10), c(20, 20))/1000
>
> # train two models: with and without standardization
> lambda = 0.01
>
> model = glmnet(X, y, family="binomial", standardize=TRUE, lambda=lambda)
> print(predict(model, xTest, type="link"))
>
> model = glmnet(X, y, family="binomial", standardize=FALSE, lambda=lambda)
> print(predict(model, xTest, type="link"))
>
> Best,
>
> Valeriy.
>
> On 04/25/2018 12:32 AM, DB Tsai wrote:
>
> As I’m one of the original authors, let me chime in for some comments.
>
> Without the standardization, the LBFGS will be unstable. For example, if a
> feature is being x 10, then the corresponding coefficient should be / 10 to
> make the same prediction. But without standardization, the LBFGS will
> converge to different solution due to numerical stability.
>
> TLDR, this can be implemented in the optimizer or in the trainer. We
> choose to implement in the trainer as LBFGS optimizer in breeze suffers
> this issue. As an user, you don’t need to care much even you have onehot
> encoding features, and the result should match R.
>
> DB Tsai  Siri Open Source Technologies [not a contribution] 
> Apple, Inc
>
> On Apr 20, 2018, at 5:56 PM, Weichen Xu <weichen.xu@databricks.com> wrote:
>
> Right. If regularization item isn't zero, then enable/disable
> standardization will get different result.
> But, if comparing results between Rglmnet and mllib, if we set the same
> parameters for regularization/standardization/... , then we should get
> the same result. If not, then maybe there's a bug. In this case you can
> paste your testing code and I can help fix it.
>
> On Sat, Apr 21, 2018 at 1:06 AM, Valeriy Avanesov <acopich@gmail.com>
> wrote:
>
>> Hi all.
>>
>> Filipp, do you use l1/l2/elsticnet penalization? I believe in this case
>> standardization matters.
>>
>> Best,
>>
>> Valeriy.
>>
>> On 04/17/2018 11:40 AM, Weichen Xu wrote:
>>
>> Not a bug.
>>
>> When disabling standadization, mllib LR will still do standadization for
>> features, but it will scale the coefficients back at the end (after
>> training finished). So it will get the same result with no standadization
>> training. The purpose of it is to improve the rate of convergence. So
>> the result should be always exactly the same with R's glmnet, no matter
>> enable or disable standadization.
>>
>> Thanks!
>>
>> On Sat, Apr 14, 2018 at 2:21 AM, Yanbo Liang <ybliang8@gmail.com> wrote:
>>
>>> Hi Filipp,
>>>
>>> MLlib’s LR implementation did the same way as R’s glmnet for
>>> standardization.
>>> Actually you don’t need to care about the implementation detail, as the
>>> coefficients are always returned on the original scale, so it should be
>>> return the same result as other popular ML libraries.
>>> Could you point me where glmnet doesn’t scale features?
>>> I suspect other issues cause your prediction quality dropped. If you can
>>> share the code and data, I can help to check it.
>>>
>>> Thanks
>>> Yanbo
>>>
>>>
>>> On Apr 8, 2018, at 1:09 PM, Filipp Zhinkin <filipp.zhinkin@gmail.com>
>>> wrote:
>>>
>>> Hi all,
>>>
>>> While migrating from custom LR implementation to MLLib's LR
>>> implementation my colleagues noticed that prediction quality dropped
>>> (accoring to different business metrics).
>>> It's turned out that this issue caused by features standardization
>>> perfomed by MLLib's LR: disregard to 'standardization' option's value all
>>> features are scaled during loss and gradient computation (as well as in few
>>> other places): https://github.com/apache/spark/blob/6cc7021a40b64c
>>> 41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/apache/
>>> spark/ml/optim/aggregator/LogisticAggregator.scala#L229
>>>
>>> According to comments in the code, standardization should be implemented
>>> the same way it was implementes in R's glmnet package. I've looked through
>>> corresponding Fortran code, an it seems like glmnet don't scale features
>>> when you're disabling standardisation (but MLLib still does).
>>>
>>> Our models contains multiple onehot encoded features and scaling them
>>> is a pretty bad idea.
>>>
>>> Why MLLib's LR always scale all features? From my POV it's a bug.
>>>
>>> Thanks in advance,
>>> Filipp.
>>>
>>>
>>>
>>
>>
>
>
>
