spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Valeriy Avanesov <acop...@gmail.com>
Subject Re: [MLLib] Logistic Regression and standadization
Date Fri, 20 Apr 2018 17:06:52 GMT
Hi all.

Filipp, do you use l1/l2/elstic-net penalization? I believe in this case 
standardization matters.

Best,

Valeriy.


On 04/17/2018 11:40 AM, Weichen Xu wrote:
> Not a bug.
>
> When disabling standadization, mllib LR will still do standadization 
> for features, but it will scale the coefficients back at the end 
> (after training finished). So it will get the same result with no 
> standadization training. The purpose of it is to improve the rate of 
> convergence. So the result should be always exactly the same with 
> R's glmnet, no matter enable or disable standadization.
>
> Thanks!
>
> On Sat, Apr 14, 2018 at 2:21 AM, Yanbo Liang <ybliang8@gmail.com 
> <mailto:ybliang8@gmail.com>> wrote:
>
>     Hi Filipp,
>
>     MLlib’s LR implementation did the same way as R’s glmnet for
>     standardization.
>     Actually you don’t need to care about the implementation detail,
>     as the coefficients are always returned on the original scale, so
>     it should be return the same result as other popular ML libraries.
>     Could you point me where glmnet doesn’t scale features?
>     I suspect other issues cause your prediction quality dropped. If
>     you can share the code and data, I can help to check it.
>
>     Thanks
>     Yanbo
>
>
>>     On Apr 8, 2018, at 1:09 PM, Filipp Zhinkin
>>     <filipp.zhinkin@gmail.com <mailto:filipp.zhinkin@gmail.com>> wrote:
>>
>>     Hi all,
>>
>>     While migrating from custom LR implementation to MLLib's LR
>>     implementation my colleagues noticed that prediction quality
>>     dropped (accoring to different business metrics).
>>     It's turned out that this issue caused by features
>>     standardization perfomed by MLLib's LR: disregard to
>>     'standardization' option's value all features are scaled during
>>     loss and gradient computation (as well as in few other places):
>>     https://github.com/apache/spark/blob/6cc7021a40b64c41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229
>>     <https://github.com/apache/spark/blob/6cc7021a40b64c41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229>
>>
>>     According to comments in the code, standardization should be
>>     implemented the same way it was implementes in R's glmnet
>>     package. I've looked through corresponding Fortran code, an it
>>     seems like glmnet don't scale features when you're disabling
>>     standardisation (but MLLib still does).
>>
>>     Our models contains multiple one-hot encoded features and scaling
>>     them is a pretty bad idea.
>>
>>     Why MLLib's LR always scale all features? From my POV it's a bug.
>>
>>     Thanks in advance,
>>     Filipp.
>>
>
>


Mime
View raw message