spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <>
Subject Re: Machine Learning on Spark [long rambling discussion email]
Date Wed, 24 Jul 2013 18:39:54 GMT
Hey Nick,

Thanks for your interest in this stuff! I'm going to let the MLbase team answer this in more
detail, but just some quick answers on the first part of your email:

- From my point of view, the ML library in Spark is meant to be just a library of "kernel"
functions you can call, not a complete ETL and data format system like Mahout. The goal was
to have good implementations of common algorithms that different higher-level systems (e.g.
MLbase, Shark, PySpark) can call into.

- We wanted to try keeping this in Spark initially to make it a kind of "standard library".
This is something that can help ensure it becomes high-quality over time and keep it supported
by the project. If you think about it, projects like R and Matlab are very strong primarily
because they have great standard libraries. This was also one of the things we thought would
differentiate us from Hadoop and Mahout. However, we will of course see how things go and
separate it out if it needs a faster dev cycle.

- I haven't worried much about compatibility with Mahout because I'm not sure Mahout is too
widely used and I'm not sure its abstractions are best. Mahout is very tied to HDFS, SequenceFiles,
etc. We will of course try to interoperate well with data from Mahout, but at least as far
as I was concerned, I wanted an API that makes sense for Spark users.

- Something that's maybe not clear about the MLlib API is that we also want it to be used
easily from Java and Python. So we've explicitly avoided having very high-level types or using
Scala-specific features, in order to get something that will be simple to call from these
languages. This does leave room for wrappers that provide higher-level interfaces.

In any case, if you like this "kernel" design for MLlib, it would be great to get more people
contributing to it, or to get it used in other projects. I'll let the MLbase folks talk about
higher-level interfaces -- this is definitely something they want to do, but they might be
able to use help. In any case though, sharing the low-level kernels across Spark projects
would make a lot of sense.


On Jul 24, 2013, at 1:46 AM, Nick Pentreath <> wrote:

> Hi dev team
> (Apologies for a long email!)
> Firstly great news about the inclusion of MLlib into the Spark project!
> I've been working on a concept and some code for a machine learning library
> on Spark, and so of course there is a lot of overlap between MLlib and what
> I've been doing.
> I wanted to throw this out there and (a) ask a couple of design and roadmap
> questions about MLLib, and (b) talk about how to work together / integrate
> my ideas (if at all :)
> *Some questions*
> *
> *
> 1. What is the general design idea behind MLLib - is it aimed at being a
> collection of algorithms, ie a library? Or is it aimed at being a "Mahout
> for Spark", i.e. something that can be used as a library as well as a set
> of tools for things like running jobs, feature extraction, text processing
> etc?
> 2. How married are we to keeping it within the Spark project? While I
> understand the reasoning behind it I am not convinced it's best. But I
> guess we can wait and see how it develops
> 3. Some of the original test code I saw around the Block ALS did use Breeze
> ( for some of the linear algebra. Now I see
> everything is using JBLAS directly and Array[Double]. Is there a specific
> reason for this? Is it aimed at creating a separation whereby the linear
> algebra backend could be switched out? Scala 2.10 issues?
> 4. Since Spark is meant to be nicely compatible with Hadoop, do we care
> about compatibility/integration with Mahout? This may also encourage Mahout
> developers to switch over and contribute their expertise (see for example
> Dmitry's work at:
> where he is doing a Scala/Spark DSL around mahout-math matrices and
> distributed operations). Potentially even using mahout-math for linear
> algebra routines?
> 5. Is there a roadmap? (I've checked the JIRA which does have a few
> intended models etc). Who are the devs most involved in this project?
> 6. What are thoughts around API design for models?
> *Some thoughts*
> *
> *
> So, over the past couple of months I have been working on a machine
> learning library. Initially it was for my own use but I've added a few
> things and was starting to think about releasing it (though it's not nearly
> ready). The model that I really needed first was ALS for doing
> recommendations. So I have ported the ALS code from Mahout to Spark. Well,
> "ported" in some sense - mostly I copied the algorithm and data
> distribution design, using Spark's primitives and Breeze for all the linear
> algebra.
> I found it pretty straightforward to port over. So far I have done local
> testing only on the Movielens datasets. I have found my RMSE results to
> match that of Mahout's. Overall interestingly the wall clock performance is
> not as dissimilar as I would have expected. But I would like to now do some
> larger-scale tests on a cluster to really do a good comparison.
> Obviously with Spark's Block ALS model, my version is now somewhat
> superfluous since I expect (and have so far seen in my simple local
> experiments) that the block model will significantly outperform. I will
> probably be porting my use case over to this in due time once I've done
> further testing.
> I also found Breeze to be very nice to work with and like the DSL - hence
> my question about why not use that? (Especially now that Breeze is actually
> just breeze-math and breeze-viz).
> Anyway, I then added KMeans (basically just the Spark example with some
> Breeze tweaks), and started working on a Linear Model framework. I've also
> added a simple framework for arg parsing and config (using Twitter
> Algebird's Args and Typesafe Config), and have started on feature
> extraction stuff - of particular interest will be text feature extraction
> and feature hashing.
> This is roughly the idea for a machine learning library on Spark that I
> have - call it a design or manifesto or whatever:
> - Library available and consistent across Scala, Java and Python (as much
> as possible in any event)
> - A core library and also a set of stuff for easily running models based on
> standard input formats etc
> - Standardised model API (even across languages) to the extent possible.
> I've based mine so far on Python's scikit-learn (.fit(), .predict() etc).
> Why? I believe it's a major strength of scikit-learn, that its API is so
> clean, simple and consistent. Plus, for the Python version of the lib,
> scikit-learn will no doubt be used wherever possible to avoid re-creating
> code
> - Models to be included initially:
>  - ALS
>  - Possibly co-occurrence recommendation stuff similar to Mahout's Taste
>  - Clustering (K-Means and others potentially)
>  - Linear Models - the idea here is to have something very close to Vowpal
> Wabbit, ie a generic SGD engine with various Loss Functions, learning rate
> paradigms etc. Furthermore this would allow other models similar to VW such
> as online versions of matrix factorisation, neural nets and learning
> reductions
>  - Possibly Decision Trees / Random Forests
> - Some utilities for feature extraction (hashing in particular), and to
> make running jobs easy (integration with Spark's ./run etc?)
> - Stuff for making pipelining easy (like scikit-learn) and for doing things
> like cross-validation in a principled (and parallel) way
> - Clean and easy integration with Spark Streaming for online models (e.g. a
> linear SGD can be called with fit() on batch data, and then fit() and/or
> fit/predict() on streaming data to learn further online etc).
> - Interactivity provided by shells (IPython, Spark shell) and also plotting
> capability (Matplotlib, and Breeze Viz)
> - For Scala, integration with Shark via sql2rdd etc.
> - I'd like to create something similar to Scalding's Matrix API based on
> RDDs for representing distributed matrices, as well as integrate the ideas
> of Dmitry and Mahout's DistributedRowMatrix etc
> Here is a rough outline of the model API I have used at the moment:
> This works nicely for ALS,
> clustering, linear models etc.
> So as you can see, mostly overlapping with what MLlib already has or has
> planned in some way, but my main aim is frankly to have consistency in the
> API, some level of abstraction but to keep things as simple as possible (ie
> let Spark handle the complex stuff), and thus hopefully avoid things
> becoming just a somewhat haphazard collection of models that is not that
> simple to figure out how to use - which is unfortunately what I believe has
> happened to Mahout.
> So the question then is, how to work together or integrate? I see 3 options:
> 1. I go my own way (not very appealing obviously)
> 2. Contribute what I have (or as much as makes sense) to MLlib
> 3. Create my project as a "front-end" or "wrapper" around MLlib as the
> core, effectively providing the API and workflow interface but using MLlib
> as the model engine.
> #2 is appealing but then a lot depends on the API and framework design and
> how much what I have in mind is compatible with the rest of the devs etc
> #3 now that I have written it, starts to sound pretty interesting -
> potentially we're looking at a "front-end" that could in fact execute
> models on Spark (or other engines like Hadoop/Mahout, GraphX etc), while
> providing workflows for pipelining transformations, feature extraction,
> testing and cross-validation, and data viz.
> But of course #3 starts sounding somewhat like what MLBase is aiming to be
> (I think)!
> At this point I'm willing to show out what I have done so far on a
> selective basis - be warned though it is rough and not finished and
> somewhat clunky perhaps as it's my first attempt at a library/framework, if
> it makes sense. Especially because really the main thing I did was the ALS
> port, and with MLlib's version of ALS that may be less useful now in any
> case.
> It may be that none of this is that useful to others anyway which is fine
> as I'll keep developing tools that I need and potentially they will be
> useful at some point.
> Thoughts, feedback, comments, discussion? I really want to jump into MLlib
> and get involved in contributing to standardised machine learning on Spark!
> Nick

View raw message