spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Bai <baig...@staff.sina.com.cn>
Subject Re: Contributing to MLlib on GLM
Date Tue, 01 Jul 2014 03:17:42 GMT
Thanks Xiaokai,

I’ve created a pull request to merge features in my PR to your repo. Please take a review
here https://github.com/xwei-datageek/spark/pull/2 .

As for GLMs, here at Sina, we are solving the problem of predicting the num of visitors who
read a particular news article or watch an online sports live stream in a particular period.
I’m trying to improve the prediction results by tuning features and incorporating new models.
So I’ll try Gamma regression later. Thanks for the implementation.

Cheers,
-Gang

On Jun 29, 2014, at 8:17 AM, xwei <weixiaokai@gmail.com> wrote:

> Hi Gang,
> 
> No worries! 
> 
> I agree LBFGS would converge faster and your test suite is more comprehensive. I'd like
to merge my branch with yours.
> 
> I also agree with your viewpoint on the redundancy issue. For different GLMs, usually
they only differ in gradient calculation but the ****regression.scala files are quite similar.
For example, linearRegressionSGD, logisticRegressionSGD, RidgeRegressionSGD, poissonRegressionSGD
all share quite a bit of common code in their class implementations. Since such redundancy
is already there in the legacy code, simply merging Poisson and Gamma does not seem to help
much. So I suggest we just leave them as separate classes for the time being. 
> 
> 
> Best regards,
> 
> Xiaokai
> 
> On Jun 27, 2014, at 6:45 PM, Gang Bai [via Apache Spark Developers List] wrote:
> 
>> Hi Xiaokai, 
>> 
>> My bad. I didn't notice this before I created another PR for Poisson regression.
The mails were buried in junk by the corp mail master. Also, thanks for considering my comments
and advice in your PR. 
>> 
>> Adding my two cents here: 
>> 
>> * PoissonRegressionModel and GammaRegressionModel have the same fields and prediction
method. Shall we use one instead of two redundant classes? Say, a LogLinearModel. 
>> * The LBFGS optimizer takes fewer iterations and results in better convergence than
SGD. I implemented two GeneralizedLinearAlgorithm classes using LBFGS and SGD respectively.
You may take a look into it. If it's OK to you, I'd be happy to send a PR to your branch.

>> * In addition to the generated test data, We may use some real-world data for testing.
In my implementation, I added the test data from https://onlinecourses.science.psu.edu/stat504/node/223.
Please check my test suite. 
>> 
>> -Gang 
>> Sent from my iPad 
>> 
>>> On 2014年6月27日, at 下午6:03, "xwei" <[hidden email]> wrote: 
>>> 
>>> 
>>> Yes, that's what we did: adding two gradient functions to Gradient.scala and

>>> create PoissonRegression and GammaRegression using these gradients. We made 
>>> a PR on this. 
>>> 
>>> 
>>> 
>>> -- 
>>> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html
>>> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

>> 
>> 
>> If you reply to this email, your message will be added to the discussion below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7107.html
>> To unsubscribe from Contributing to MLlib on GLM, click here.
>> NAML
> 
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7117.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Mime
View raw message