spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Weichen Xu <weichen...@databricks.com>
Subject Re: [MLLib] Logistic Regression and standadization
Date Sat, 21 Apr 2018 00:56:41 GMT
Right. If regularization item isn't zero, then enable/disable
standardization will get different result.
But, if comparing results between R-glmnet and mllib, if we set the same
parameters for regularization/standardization/... , then we should get the
same result. If not, then maybe there's a bug. In this case you can paste
your testing code and I can help fix it.

On Sat, Apr 21, 2018 at 1:06 AM, Valeriy Avanesov <acopich@gmail.com> wrote:

> Hi all.
>
> Filipp, do you use l1/l2/elstic-net penalization? I believe in this case
> standardization matters.
>
> Best,
>
> Valeriy.
>
> On 04/17/2018 11:40 AM, Weichen Xu wrote:
>
> Not a bug.
>
> When disabling standadization, mllib LR will still do standadization for
> features, but it will scale the coefficients back at the end (after
> training finished). So it will get the same result with no standadization
> training. The purpose of it is to improve the rate of convergence. So the
> result should be always exactly the same with R's glmnet, no matter
> enable or disable standadization.
>
> Thanks!
>
> On Sat, Apr 14, 2018 at 2:21 AM, Yanbo Liang <ybliang8@gmail.com> wrote:
>
>> Hi Filipp,
>>
>> MLlib’s LR implementation did the same way as R’s glmnet for
>> standardization.
>> Actually you don’t need to care about the implementation detail, as the
>> coefficients are always returned on the original scale, so it should be
>> return the same result as other popular ML libraries.
>> Could you point me where glmnet doesn’t scale features?
>> I suspect other issues cause your prediction quality dropped. If you can
>> share the code and data, I can help to check it.
>>
>> Thanks
>> Yanbo
>>
>>
>> On Apr 8, 2018, at 1:09 PM, Filipp Zhinkin <filipp.zhinkin@gmail.com>
>> wrote:
>>
>> Hi all,
>>
>> While migrating from custom LR implementation to MLLib's LR
>> implementation my colleagues noticed that prediction quality dropped
>> (accoring to different business metrics).
>> It's turned out that this issue caused by features standardization
>> perfomed by MLLib's LR: disregard to 'standardization' option's value all
>> features are scaled during loss and gradient computation (as well as in few
>> other places): https://github.com/apache/spark/blob/6cc7021a40b64c
>> 41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/
>> apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229
>>
>> According to comments in the code, standardization should be implemented
>> the same way it was implementes in R's glmnet package. I've looked through
>> corresponding Fortran code, an it seems like glmnet don't scale features
>> when you're disabling standardisation (but MLLib still does).
>>
>> Our models contains multiple one-hot encoded features and scaling them is
>> a pretty bad idea.
>>
>> Why MLLib's LR always scale all features? From my POV it's a bug.
>>
>> Thanks in advance,
>> Filipp.
>>
>>
>>
>
>

Mime
View raw message