spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From m...@mbowles.com
Subject Re: Have Friedman's glmnet algo running in Spark
Date Wed, 25 Feb 2015 18:35:14 GMT
 Hi Debasish, 
Any method that generates point solutions to the minimization problem could simply be run
a number of times to generate the coefficient paths as a function of the penalty parameter.
I think the only issues are how easy the method is to use and how much training and developer
time is required to produce an answer. 

With regard to training time, Friedman says in his paper that they found problems where glmnet
would generate the entire coefficient path more rapidly than sophisticated single point methods
would generate single point solutions - not all problems, but some problems. Ryan Tibshirani
(Robert's son) who's a professor and researcher at CMU in convex function optimization has
echoed that assertion for the particular case of the elasticnet penalty function (that's from
slides of his that are available online). So there's an open question about the training speed
that i believe we can answer in fairly short order. I'm eager to explore that. Does OWLQN
do a pass through the data for each iteration? The linear version of GLMNET does not. On the
other hand, OWLQN may be able to take coarser steps through parameter space. 

With regard to developer time, glmnet doesn't require the user to supply a starting point
for the penalty parameter. It calculates the starting point. That makes it completely automatic.
you've probably been through the process of manually searching regularization parameter space
with SVM. Pick out a set of regularization parameter values like 10 raised to the (-2 through
+5 in steps of 1). See if there's a minimum in the range and if not shift to the right or
left. One of the reasons I pick up glmnet first for a new problem is that you just drop in
the training set and out pop the coefficient curves. Usually the defaults work. One time out
of 50 (or so) it doesn't converge. It alerts you that it didn't converge and you change one
parameter and rerun. If you also drop in a test set then it even picks the optimum solution
andproduces an estimate of out-of-sample error. 

We're going to make some speed/scaling runs on the synthetic data sets (in a range of sizes)
that are used in Spark for testing linear regression. We need some wider data sets. Joseph
mentioned some that we'll look at. I've got a gene expression data set that's 30k wide by
15k tall. That takes a few hours to train using R version of glmnet. We're also talking to
some biology friends to find other interesting data sets. 

I really am eager to see the comparisons. And happy to help you tailor OWLQN to generate coefficient
paths. We might be able to produce a hybrid of Friedman's algorithm using his basic algorithm
outline but substituting OWLQN for his round-robin coordinate descent. But i'm a little cocerned
that it's the round-robin coordinate descent that makes it possible to skip passing through
the full data set for 4 out of 5 iterations. We might be able to work a way around that. 

I'm just eager to have parallel versions of the tools available. I'll keep you posted on our
results. We should aim for running one another's code. I'll check with my colleagues and see
when we'll have something we can hand out. We've delayed putting together a release version
in favor of generating some scaling results, as Joseph suggested. Discussions like this may
have some impact on what the release code looks like. 
Mike






-----Original Message---
From: Debasish Das [mailto:debasish.das83@gmail.com]
Sent: Wednesday, February 25, 2015 08:50 AM
To: 'Joseph Bradley'
Cc: mike@mbowles.com, 'dev'
Subject: Re: Have Friedman's glmnet algo running in Spark

Any reason why the regularization path cannot be implemented using current owlqn pr ?
We can change owlqn in breeze to fit your needs...

On Feb 24, 2015 3:27 PM, "Joseph Bradley" <joseph@databricks.com> wrote:
Hi Mike,

I'm not aware of a "standard" big dataset, but there are a number available:
* The YearPredictionMSD dataset from the LIBSVM datasets is sizeable (in #
instances but not # features):
www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html
* I've used this text dataset from which one can generate lots of n-gram
features (but not many instances): http://www.ark.cs.cmu.edu/10K/
* I've seen some papers use the KDD Cup datasets, which might be the best
option I know of. The KDD Cup 2012 track 2 one seems promising.

Good luck!
Joseph

On Tue, Feb 24, 2015 at 1:56 PM, <mike@mbowles.com> wrote:

> Joseph,
> Thanks for your reply. We'll take the steps you suggest - generate some
> timing comparisons and post them in the GLMNET JIRA with a link from the
> OWLQN JIRA.
>
> We've got the regression version of GLMNET programmed. The regression
> version only requires a pass through the data each time the active set of
> coefficients changes. That's usualy less than or equal to the number of
> decrements in the penalty coefficient (typical default = 100). The
> intermediate iterations can be done using results of previous passes
> through the full data set. We're expecting the number of data passes will
> be independent of either number of rows or columns in the data set. We're
> eager to demonstrate this scaling. Do you have any suggestions regarding
> data sets for large scale regression problems? It would be nice to
> demonstrate scaling for both number of rows and number of columns.
>
> Thanks for your help.
> Mike
>
> -----Original Message-----
> *From:* Joseph Bradley [mailto:joseph@databricks.com]
> *Sent:* Sunday, February 22, 2015 06:48 PM
> *To:* mike@mbowles.com
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Have Friedman's glmnet algo running in Spark
>
> Hi Mike, glmnet has definitely been very successful, and it would be great
> to see how we can improve optimization in MLlib! There is some related work
> ongoing; here are the JIRAs: GLMNET implementation in Spark
> LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package
> The GLMNET JIRA has actually been closed in favor of the latter JIRA.
> However, if you're getting good results in your experiments, could you
> please post them on the GLMNET JIRA and link them from the other JIRA? If
> it's faster and more scalable, that would be great to find out. As far as
> where the code should go and the APIs, that can be discussed on the JIRA. I
> hope this helps, and I'll keep an eye out for updates on the JIRAs! Joseph
> On Thu, Feb 19, 2015 at 10:59 AM, wrote: > Dev List, > A couple of
> colleagues and I have gotten several versions of glmnet algo > coded and
> running on Spark RDD. glmnet algo ( >
> http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for >
> generating coefficient paths solving penalized regression with elastic net
> > penalties. The algorithm runs fast by taking an approach that generates >
> solutions for a wide variety of penalty parameter. We're able to integrate
> > into Mllib class structure a couple of different ways. The algorithm may
> > fit better into the new pipeline structure since it naturally returns a >
> multitide of models (corresponding to different vales of penalty >
> parameters). That appears to fit better into pipeline than Mllib linear >
> regression (for example). > > We've got regression running with the speed
> optimizations that Friedman > recommends. We'll start working on the
> logistic regression version next. > > We're eager to make the code
> available as open source and would like to > get some feedback about how
> best to do that. Any thoughts? > Mike Bowles. > > >
>
>



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message