spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ameet Talwalkar <am...@eecs.berkeley.edu>
Subject Re: Machine Learning on Spark [long rambling discussion email]
Date Thu, 25 Jul 2013 18:41:47 GMT
Hi Nick,

I can understand your 'frustration' -- my hope is that having discussions
(like the one we're having now) via this mailing list will help mitigate
duplicate work moving forward.

Regarding your detailed comments, we are aiming to include various
components that you mentioned in our release (basic evaluation for
collaborative filtering, linear model additions, and basic support for
sparse vectors/features).  One particularly interesting avenue that is not
on our immediate roadmap is adding implicit feedback for matrix
factorization.  Algorithms like SVD++ are often used in practice, and it
would be great to add them to the MLI library (and perhaps also MLlib).

-Ameet


On Thu, Jul 25, 2013 at 6:44 AM, Nick Pentreath <nick.pentreath@gmail.com>wrote:

> Hi
>
> Ok, that all makes sense. I can see the benefit of good standard libraries
> definitely, and I guess the pieces that felt "missing" to me were what you
> are describing as MLI and MLOptimizer.
>
> It seems like the aims of MLI are very much in line with what I have/had in
> mind for a ML library/framework. It seems the goals overlap quite a lot.
>
> I guess one "frustration" I have had is that there are all these great BDAS
> projects, but we never really know when they will be released and what they
> will look like until they are. In this particular case I couldn't wait for
> MLlib so ended up doing some work myself to port Mahout's ALS and of course
> have ended up duplicating effort (which is not a problem as it was
> necessary at the time and has been a great learning experience).
>
> Similarly for GraphX, I would like to develop a project for a Spark-based
> version of Faunus (https://github.com/thinkaurelius/faunus) for batch
> processing of data in our Titan graph DB. For now I am working with
> Bagel-based primitives and Spark RDDs directly, but would love to use
> GraphX, but have no idea when it will be released and have little
> involvement until it is.
>
> (I use "frustration" in the nicest way here - I love the BDAS concepts and
> all the projects coming out, I just want them all to be released NOW!! :)
>
> So yes I would love to be involved in MLlib and MLI work to the extent I
> can assist and the work is aligned with what I need currently in my
> projects (this is just from a time allocation viewpoint - I'm sure much of
> it will be complementary).
>
> Anyway, it seems to me the best course of action is as follows:
>
>    - I'll get involved in MLlib and see how I can contribute there. Some
>    things that jump out:
>
>
>    - implicit preference capability for ALS model since as far as I can see
>       currently it handles explicit prefs only? (Implicit prefs here:
>       http://68.180.206.246/files/HuKorenVolinsky-ICDM08.pdf which is
>       typically better if we don't have actual rating data but instead
> "view",
>       "click", "play" or whatever)

      - RMSE and other evaluation metrics for ALS as well as test/train
>       split / cross-val stuff?

      - linear model additions, like new loss functions for hinge loss,
>       least squares etc for SGD, as well as learning rate stuff (
>       http://arxiv.org/pdf/1305.6646) and regularisers (L1/L2/Elasic Net)
> -
>       i.e. bring the SGD stuff in line with Vowpal Wabbit / sklearn (if
> that's
>       desirable, my view is yes)

      - what about sparse weight and feature vectors for linear models/SGD?
>       Together with hashing allows very large models while still being
> efficient,
>       and with L1 reg is particularly useful.

      - finally what about online models? ie SGD models currently are
>       "static" ie once trained can only predict, whereas SGD can of course
> keep
>       learning. Or does one simply re-train with the previous initial
> weight
>       vector (I guess that can work just as well)... Also on this
> topic training
>       / predicting on Streams as well as RDDs
>    - I can put up what I have done to a BitBucket account and grant access
>    to whichever devs would like to take a look. The only reason I don't
> just
>    throw it up on GitHub is that frankly it is not really ready and is not
> a
>    fully-fledged project yet (I think anyway). Possibly some of this can be
>    useful (not that there's all that much there apart from the ALS (but it
>    does solve for both explicit and implicit preference data as per
> Mahout's
>    implementation), KMeans (simpler than the one in MLlib as I didn't yet
> get
>    around to doing KMeans++ init) and the arg-parsing / jobrunner (which
> may
>    or may not be interesting both for ML and for Spark jobs in general)).
>
> Let me know your thoughts
> Nick
>
>
> On Wed, Jul 24, 2013 at 10:09 PM, Ameet Talwalkar
> <ameet@eecs.berkeley.edu>wrote:
>
> > Hi Nick,
> >
> > Thanks for your email, and it's great to see such excitement around this
> > work!  Matei and Reynold already addressed the motivation behind MLlib as
> > well as our reasons for not using Breeze, and I'd like to give you some
> > background about MLbase, and discuss how it may fit with your interests.
> >
> > There are three components of MLbase:
> >
> > 1) MLlib: As Matei mentioned, this is an ML library in Spark with core ML
> > kernels and solid implementations of common algorithms that can be used
> > easily by Java/Python and also called into by higher-level systems (e.g.
> > MLI, Shark, PySpark).
> >
> > 2) MLI: this is an ML API that provides a common interface for ML
> > algorithms (the same interface used in MLlib), and introduces high-level
> > abstractions to simplify feature extraction / exploration and ML
> algorithm
> > development.  These abstractions leverage the kernels in MLlib when
> > possible, and also introduce additional kernels.  This work also
> includes a
> > library written against the MLI.  The MLI is currently written against
> > Spark, but is designed to be platform independent, so that code written
> > against MLI could be run on different engines (e.g., Hadoop, GraphX,
> etc.).
> >
> >
> > 3) ML Optimizer: This piece automates the task of model selection.  The
> > optimizer can be viewed as a search problem over feature extraction /
> > algorithms included in the MLI library, and is in part based on efficient
> > cross validation. This work is under active development but is in an
> > earlier stage of development than MLlib and MLI.
> >
> > (note: MLlib will be included with the Spark codebase, while the MLI and
> ML
> > Optimizer will live in separate repositories.)
> >
> > As far as I can tell (though please correct me if I've misunderstood)
> your
> > main goals include:
> >
> > i) "consistency in the API"
> > ii) "some level of abstraction but to keep things as simple as possible"
> > iii) "execute models on Spark ... while providing workflows for
> pipelining
> > transformations, feature extraction, testing and cross-validation, and
> data
> > viz."
> >
> > The MLI (and to some extent the ML Optimizer) is very much in line with
> > these goals, and it would be great if you were interested in contributing
> > to it.  MLI is a private repository right now, but we'll make it public
> > soon though, and Evan Sparks or I will let you know when we do so.
> >
> > Thanks again for getting in touch with us!
> >
> > -Ameet
> >
> >
> > On Wed, Jul 24, 2013 at 11:47 AM, Reynold Xin <rxin@cs.berkeley.edu>
> > wrote:
> >
> > > On Wed, Jul 24, 2013 at 1:46 AM, Nick Pentreath <
> > nick.pentreath@gmail.com
> > > >wrote:
> > >
> > > >
> > > > I also found Breeze to be very nice to work with and like the DSL -
> > hence
> > > > my question about why not use that? (Especially now that Breeze is
> > > actually
> > > > just breeze-math and breeze-viz).
> > > >
> > >
> > >
> > > Matei addressed this from a higher level. I want to provide a little
> bit
> > > more context. A common properties of a lot of high level Scala DSL
> > > libraries is that simple operators tend to have high virtual function
> > > overheads and also create a lot of temporary objects. And because the
> > level
> > > of abstraction is so high, it is fairly hard to debug / optimize
> > > performance.
> > >
> > >
> > >
> > >
> > > --
> > > Reynold Xin, AMPLab, UC Berkeley
> > > http://rxin.org
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message