spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Debasish Das <debasish.da...@gmail.com>
Subject Re: Have Friedman's glmnet algo running in Spark
Date Wed, 25 Feb 2015 16:50:55 GMT
Any reason why the regularization path cannot be implemented using current
owlqn pr ?

We can change owlqn in breeze to fit your needs...
 On Feb 24, 2015 3:27 PM, "Joseph Bradley" <joseph@databricks.com> wrote:

> Hi Mike,
>
> I'm not aware of a "standard" big dataset, but there are a number
> available:
> * The YearPredictionMSD dataset from the LIBSVM datasets is sizeable (in #
> instances but not # features):
> www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html
> * I've used this text dataset from which one can generate lots of n-gram
> features (but not many instances): http://www.ark.cs.cmu.edu/10K/
> * I've seen some papers use the KDD Cup datasets, which might be the best
> option I know of.  The KDD Cup 2012 track 2 one seems promising.
>
> Good luck!
> Joseph
>
> On Tue, Feb 24, 2015 at 1:56 PM, <mike@mbowles.com> wrote:
>
> > Joseph,
> > Thanks for your reply.  We'll take the steps you suggest - generate some
> > timing comparisons and post them in the GLMNET JIRA with a link from the
> > OWLQN JIRA.
> >
> > We've got the regression version of GLMNET programmed.  The regression
> > version only requires a pass through the data each time the active set of
> > coefficients changes.  That's usualy less than or equal to the number of
> > decrements in the penalty coefficient (typical default = 100).  The
> > intermediate iterations can be done using results of previous passes
> > through the full data set.  We're expecting the number of data passes
> will
> > be independent of either number of rows or columns in the data set.
> We're
> > eager to demonstrate this scaling.  Do you have any suggestions regarding
> > data sets for large scale regression problems?  It would be nice to
> > demonstrate scaling for both number of rows and number of columns.
> >
> > Thanks for your help.
> > Mike
> >
> > -----Original Message-----
> > *From:* Joseph Bradley [mailto:joseph@databricks.com]
> > *Sent:* Sunday, February 22, 2015 06:48 PM
> > *To:* mike@mbowles.com
> > *Cc:* dev@spark.apache.org
> > *Subject:* Re: Have Friedman's glmnet algo running in Spark
> >
> > Hi Mike, glmnet has definitely been very successful, and it would be
> great
> > to see how we can improve optimization in MLlib! There is some related
> work
> > ongoing; here are the JIRAs: GLMNET implementation in Spark
> > LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package
> > The GLMNET JIRA has actually been closed in favor of the latter JIRA.
> > However, if you're getting good results in your experiments, could you
> > please post them on the GLMNET JIRA and link them from the other JIRA? If
> > it's faster and more scalable, that would be great to find out. As far as
> > where the code should go and the APIs, that can be discussed on the
> JIRA. I
> > hope this helps, and I'll keep an eye out for updates on the JIRAs!
> Joseph
> > On Thu, Feb 19, 2015 at 10:59 AM,  wrote: > Dev List, > A couple of
> > colleagues and I have gotten several versions of glmnet algo > coded and
> > running on Spark RDD. glmnet algo ( >
> > http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for >
> > generating coefficient paths solving penalized regression with elastic
> net
> > > penalties. The algorithm runs fast by taking an approach that
> generates >
> > solutions for a wide variety of penalty parameter. We're able to
> integrate
> > > into Mllib class structure a couple of different ways. The algorithm
> may
> > > fit better into the new pipeline structure since it naturally returns
> a >
> > multitide of models (corresponding to different vales of penalty >
> > parameters). That appears to fit better into pipeline than Mllib linear >
> > regression (for example). > > We've got regression running with the speed
> > optimizations that Friedman > recommends. We'll start working on the
> > logistic regression version next. > > We're eager to make the code
> > available as open source and would like to > get some feedback about how
> > best to do that. Any thoughts? > Mike Bowles. > > >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message