spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Bradley <jos...@databricks.com>
Subject Re: Have Friedman's glmnet algo running in Spark
Date Wed, 25 Feb 2015 19:14:52 GMT
Some of this discussion seems valuable enough to preserve on the JIRA; can
we move it there (and copy any relevant discussions from previous emails as
needed)?

On Wed, Feb 25, 2015 at 10:35 AM, <mike@mbowles.com> wrote:

> Hi Debasish,
> Any method that generates point solutions to the minimization problem
> could simply be run a number of times to generate the coefficient paths as
> a function of the penalty parameter.  I think the only issues are how easy
> the method is to use and how much training and developer time is required
> to produce an answer.
>
> With regard to training time, Friedman says in his paper that they found
> problems where glmnet would generate the entire coefficient path more
> rapidly than sophisticated single point methods would generate single point
> solutions - not all problems, but some problems.  Ryan Tibshirani (Robert's
> son) who's a professor and researcher at CMU in convex function
> optimization has echoed that assertion for the particular case of the
> elasticnet penalty function (that's from slides of his that are available
> online).  So there's an open question about the training speed that i
> believe we can answer in fairly short order.  I'm eager to explore that.
> Does OWLQN do a pass through the data for each iteration?  The linear
> version of GLMNET does not.  On the other hand, OWLQN may be able to take
> coarser steps through parameter space.
>
> With regard to developer time, glmnet doesn't require the user to supply a
> starting point for the penalty parameter.  It calculates the starting
> point.  That makes it completely automatic.  you've probably been through
> the process of manually searching regularization parameter space with SVM.
> Pick out a set of regularization parameter values like 10 raised to the (-2
> through +5 in steps of 1).  See if there's a minimum in the range and if
> not shift to the right or left.  One of the reasons I pick up glmnet first
> for a new problem is that you just drop in the training set and out pop the
> coefficient curves.  Usually the defaults work.  One time out of 50 (or so)
> it doesn't converge.  It alerts you that it didn't converge and you change
> one parameter and rerun.  If you also drop in a test set then it even picks
> the optimum solution andproduces an estimate of out-of-sample error.
>
> We're going to make some speed/scaling runs on the synthetic data sets (in
> a range of sizes) that are used in Spark for testing linear regression.  We
> need some wider data sets.  Joseph mentioned some that we'll look at.  I've
> got a gene expression data set that's 30k wide by 15k tall.  That takes a
> few hours to train using R version of glmnet.  We're also talking to some
> biology friends to find other interesting data sets.
>
> I really am eager to see the comparisons.  And happy to help you tailor
> OWLQN to generate coefficient paths.  We might be able to produce a hybrid
> of Friedman's algorithm using his basic algorithm outline but substituting
> OWLQN for his round-robin coordinate descent.  But i'm a little cocerned
> that it's the round-robin coordinate descent that makes it possible to skip
> passing through the full data set for 4 out of 5 iterations.  We might be
> able to work a way around that.
>
> I'm just eager to have parallel versions of the tools available.  I'll
> keep you posted on our results.  We should aim for running one another's
> code.  I'll check with my colleagues and see when we'll have something we
> can hand out.  We've delayed putting together a release version in favor of
> generating some scaling results, as Joseph suggested.  Discussions like
> this may have some impact on what the release code looks like.
> Mike
>
>
>
>
>
> -----Original Message---
> *From:* Debasish Das [mailto:debasish.das83@gmail.com]
> *Sent:* Wednesday, February 25, 2015 08:50 AM
> *To:* 'Joseph Bradley'
> *Cc:* mike@mbowles.com, 'dev'
> *Subject:* Re: Have Friedman's glmnet algo running in Spark
>
> Any reason why the regularization path cannot be implemented using current
> owlqn pr ?
>
> We can change owlqn in breeze to fit your needs...
>  On Feb 24, 2015 3:27 PM, "Joseph Bradley" <joseph@databricks.com> wrote:
>
>> Hi Mike,
>>
>> I'm not aware of a "standard" big dataset, but there are a number
>> available:
>> * The YearPredictionMSD dataset from the LIBSVM datasets is sizeable (in #
>> instances but not # features):
>> www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html
>> * I've used this text dataset from which one can generate lots of n-gram
>> features (but not many instances): http://www.ark.cs.cmu.edu/10K/
>> * I've seen some papers use the KDD Cup datasets, which might be the best
>> option I know of.  The KDD Cup 2012 track 2 one seems promising.
>>
>> Good luck!
>> Joseph
>>
>> On Tue, Feb 24, 2015 at 1:56 PM, <mike@mbowles.com> wrote:
>>
>> > Joseph,
>> > Thanks for your reply.  We'll take the steps you suggest - generate some
>> > timing comparisons and post them in the GLMNET JIRA with a link from the
>> > OWLQN JIRA.
>> >
>> > We've got the regression version of GLMNET programmed.  The regression
>> > version only requires a pass through the data each time the active set
>> of
>> > coefficients changes.  That's usualy less than or equal to the number of
>> > decrements in the penalty coefficient (typical default = 100).  The
>> > intermediate iterations can be done using results of previous passes
>> > through the full data set.  We're expecting the number of data passes
>> will
>> > be independent of either number of rows or columns in the data set.
>> We're
>> > eager to demonstrate this scaling.  Do you have any suggestions
>> regarding
>> > data sets for large scale regression problems?  It would be nice to
>> > demonstrate scaling for both number of rows and number of columns.
>> >
>> > Thanks for your help.
>> > Mike
>> >
>> > -----Original Message-----
>> > *From:* Joseph Bradley [mailto:joseph@databricks.com]
>> > *Sent:* Sunday, February 22, 2015 06:48 PM
>> > *To:* mike@mbowles.com
>> > *Cc:* dev@spark.apache.org
>> > *Subject:* Re: Have Friedman's glmnet algo running in Spark
>> >
>> > Hi Mike, glmnet has definitely been very successful, and it would be
>> great
>> > to see how we can improve optimization in MLlib! There is some related
>> work
>> > ongoing; here are the JIRAs: GLMNET implementation in Spark
>> > LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package
>> > The GLMNET JIRA has actually been closed in favor of the latter JIRA.
>> > However, if you're getting good results in your experiments, could you
>> > please post them on the GLMNET JIRA and link them from the other JIRA?
>> If
>> > it's faster and more scalable, that would be great to find out. As far
>> as
>> > where the code should go and the APIs, that can be discussed on the
>> JIRA. I
>> > hope this helps, and I'll keep an eye out for updates on the JIRAs!
>> Joseph
>> > On Thu, Feb 19, 2015 at 10:59 AM,  wrote: > Dev List, > A couple of
>> > colleagues and I have gotten several versions of glmnet algo > coded and
>> > running on Spark RDD. glmnet algo ( >
>> > http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for >
>> > generating coefficient paths solving penalized regression with elastic
>> net
>> > > penalties. The algorithm runs fast by taking an approach that
>> generates >
>> > solutions for a wide variety of penalty parameter. We're able to
>> integrate
>> > > into Mllib class structure a couple of different ways. The algorithm
>> may
>> > > fit better into the new pipeline structure since it naturally returns
>> a >
>> > multitide of models (corresponding to different vales of penalty >
>> > parameters). That appears to fit better into pipeline than Mllib linear
>> >
>> > regression (for example). > > We've got regression running with the
>> speed
>> > optimizations that Friedman > recommends. We'll start working on the
>> > logistic regression version next. > > We're eager to make the code
>> > available as open source and would like to > get some feedback about how
>> > best to do that. Any thoughts? > Mike Bowles. > > >
>> >
>> >
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message