spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "mike bowles (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-1673) GLMNET implementation in Spark
Date Thu, 26 Feb 2015 18:49:05 GMT

    [ https://issues.apache.org/jira/browse/SPARK-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338897#comment-14338897
] 

mike bowles commented on SPARK-1673:
------------------------------------

Some colleagues and I have a Spark version of glmnet working and have started some discussion.
 Joseph Bradley suggested that we copy the discussion here in order to keep track of it. 
Here's the discussion thread in the usual last-first order.  Besides myself, the thread involves
Joseph and Debashish Das who is working on the OWLQN implementation.  

On Wed, Feb 25, 2015 at 10:35 AM, <mike@mbowles.com> wrote:

Hi Debasish,
Any method that generates point solutions to the minimization problem could simply be run
a number of times to generate the coefficient paths as a function of the penalty parameter.
 I think the only issues are how easy the method is to use and how much training and developer
time is required to produce an answer. 

With regard to training time, Friedman says in his paper that they found problems where glmnet
would generate the entire coefficient path more rapidly than sophisticated single point methods
would generate single point solutions - not all problems, but some problems.  Ryan Tibshirani
(Robert's son) who's a professor and researcher at CMU in convex function optimization has
echoed that assertion for the particular case of the elasticnet penalty function (that's from
slides of his that are available online).  So there's an open question about the training
speed that i believe we can answer in fairly short order.  I'm eager to explore that.  Does
OWLQN do a pass through the data for each iteration?  The linear version of GLMNET does not.
 On the other hand, OWLQN may be able to take coarser steps through parameter space.  

With regard to developer time, glmnet doesn't require the user to supply a starting point
for the penalty parameter.  It calculates the starting point.  That makes it completely automatic.
 you've probably been through the process of manually searching regularization parameter space
with SVM.  Pick out a set of regularization parameter values like 10 raised to the (-2 through
+5 in steps of 1).  See if there's a minimum in the range and if not shift to the right or
left.  One of the reasons I pick up glmnet first for a new problem is that you just drop in
the training set and out pop the coefficient curves.  Usually the defaults work.  One time
out of 50 (or so) it doesn't converge.  It alerts you that it didn't converge and you change
one parameter and rerun.  If you also drop in a test set then it even picks the optimum solution
andproduces an estimate of out-of-sample error.  

We're going to make some speed/scaling runs on the synthetic data sets (in a range of sizes)
that are used in Spark for testing linear regression.  We need some wider data sets.  Joseph
mentioned some that we'll look at.  I've got a gene expression data set that's 30k wide by
15k tall.  That takes a few hours to train using R version of glmnet.  We're also talking
to some biology friends to find other interesting data sets. 

I really am eager to see the comparisons.  And happy to help you tailor OWLQN to generate
coefficient paths.  We might be able to produce a hybrid of Friedman's algorithm using his
basic algorithm outline but substituting OWLQN for his round-robin coordinate descent.  But
i'm a little cocerned that it's the round-robin coordinate descent that makes it possible
to skip passing through the full data set for 4 out of 5 iterations.  We might be able to
work a way around that. 

I'm just eager to have parallel versions of the tools available.  I'll keep you posted on
our results.  We should aim for running one another's code.  I'll check with my colleagues
and see when we'll have something we can hand out.  We've delayed putting together a release
version in favor of generating some scaling results, as Joseph suggested.  Discussions like
this may have some impact on what the release code looks like. 
Mike

From: Debasish Das [mailto:debasish.das83@gmail.com]
Sent: Wednesday, February 25, 2015 08:50 AM
To: 'Joseph Bradley'
Cc: mike@mbowles.com, 'dev'
Subject: Re: Have Friedman's glmnet algo running in Spark

Any reason why the regularization path cannot be implemented using current owlqn pr ?

We can change owlqn in breeze to fit your needs...

*From:* Joseph Bradley [mailto:joseph@databricks.com]
*Sent:* Sunday, February 22, 2015 06:48 PM
*To:* mike@mbowles.com
*Cc:* dev@spark.apache.org
 *Subject:* Re: Have Friedman's glmnet algo running in Spark
 Hi Mike, glmnet has definitely been very successful, and it would be great to see how we
can improve optimization in MLlib! There is some related work  ongoing; here are the JIRAs:
GLMNET implementation in Spark LinearRegression with L1/L2 (elastic net) using OWLQN in new
ML package The GLMNET JIRA has actually been closed in favor of the latter JIRA.   However,
if you're getting good results in your experiments, could you please post them on the GLMNET
JIRA and link them from the other JIRA? If  it's faster and more scalable, that would be great
to find out. As far as where the code should go and the APIs, that can be discussed on the
JIRA. I hope this helps, and I'll keep an eye out for updates on the JIRAs! Joseph

> GLMNET implementation in Spark
> ------------------------------
>
>                 Key: SPARK-1673
>                 URL: https://issues.apache.org/jira/browse/SPARK-1673
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Sung Chung
>
> This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, Rob Tibshirani.
> http://www.jstatsoft.org/v33/i01/paper
> It's a straightforward implementation of the Coordinate-Descent based L1/L2 regularized
linear models, including Linear/Logistic/Multinomial regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message