spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Pentreath <>
Subject Machine Learning on Spark [long rambling discussion email]
Date Wed, 24 Jul 2013 08:46:17 GMT
Hi dev team

(Apologies for a long email!)

Firstly great news about the inclusion of MLlib into the Spark project!

I've been working on a concept and some code for a machine learning library
on Spark, and so of course there is a lot of overlap between MLlib and what
I've been doing.

I wanted to throw this out there and (a) ask a couple of design and roadmap
questions about MLLib, and (b) talk about how to work together / integrate
my ideas (if at all :)

*Some questions*
1. What is the general design idea behind MLLib - is it aimed at being a
collection of algorithms, ie a library? Or is it aimed at being a "Mahout
for Spark", i.e. something that can be used as a library as well as a set
of tools for things like running jobs, feature extraction, text processing
2. How married are we to keeping it within the Spark project? While I
understand the reasoning behind it I am not convinced it's best. But I
guess we can wait and see how it develops
3. Some of the original test code I saw around the Block ALS did use Breeze
( for some of the linear algebra. Now I see
everything is using JBLAS directly and Array[Double]. Is there a specific
reason for this? Is it aimed at creating a separation whereby the linear
algebra backend could be switched out? Scala 2.10 issues?
4. Since Spark is meant to be nicely compatible with Hadoop, do we care
about compatibility/integration with Mahout? This may also encourage Mahout
developers to switch over and contribute their expertise (see for example
Dmitry's work at:,
where he is doing a Scala/Spark DSL around mahout-math matrices and
distributed operations). Potentially even using mahout-math for linear
algebra routines?
5. Is there a roadmap? (I've checked the JIRA which does have a few
intended models etc). Who are the devs most involved in this project?
6. What are thoughts around API design for models?

*Some thoughts*
So, over the past couple of months I have been working on a machine
learning library. Initially it was for my own use but I've added a few
things and was starting to think about releasing it (though it's not nearly
ready). The model that I really needed first was ALS for doing
recommendations. So I have ported the ALS code from Mahout to Spark. Well,
"ported" in some sense - mostly I copied the algorithm and data
distribution design, using Spark's primitives and Breeze for all the linear

I found it pretty straightforward to port over. So far I have done local
testing only on the Movielens datasets. I have found my RMSE results to
match that of Mahout's. Overall interestingly the wall clock performance is
not as dissimilar as I would have expected. But I would like to now do some
larger-scale tests on a cluster to really do a good comparison.

Obviously with Spark's Block ALS model, my version is now somewhat
superfluous since I expect (and have so far seen in my simple local
experiments) that the block model will significantly outperform. I will
probably be porting my use case over to this in due time once I've done
further testing.

I also found Breeze to be very nice to work with and like the DSL - hence
my question about why not use that? (Especially now that Breeze is actually
just breeze-math and breeze-viz).

Anyway, I then added KMeans (basically just the Spark example with some
Breeze tweaks), and started working on a Linear Model framework. I've also
added a simple framework for arg parsing and config (using Twitter
Algebird's Args and Typesafe Config), and have started on feature
extraction stuff - of particular interest will be text feature extraction
and feature hashing.

This is roughly the idea for a machine learning library on Spark that I
have - call it a design or manifesto or whatever:

- Library available and consistent across Scala, Java and Python (as much
as possible in any event)
- A core library and also a set of stuff for easily running models based on
standard input formats etc
- Standardised model API (even across languages) to the extent possible.
I've based mine so far on Python's scikit-learn (.fit(), .predict() etc).
Why? I believe it's a major strength of scikit-learn, that its API is so
clean, simple and consistent. Plus, for the Python version of the lib,
scikit-learn will no doubt be used wherever possible to avoid re-creating
- Models to be included initially:
  - ALS
  - Possibly co-occurrence recommendation stuff similar to Mahout's Taste
  - Clustering (K-Means and others potentially)
  - Linear Models - the idea here is to have something very close to Vowpal
Wabbit, ie a generic SGD engine with various Loss Functions, learning rate
paradigms etc. Furthermore this would allow other models similar to VW such
as online versions of matrix factorisation, neural nets and learning
  - Possibly Decision Trees / Random Forests
- Some utilities for feature extraction (hashing in particular), and to
make running jobs easy (integration with Spark's ./run etc?)
- Stuff for making pipelining easy (like scikit-learn) and for doing things
like cross-validation in a principled (and parallel) way
- Clean and easy integration with Spark Streaming for online models (e.g. a
linear SGD can be called with fit() on batch data, and then fit() and/or
fit/predict() on streaming data to learn further online etc).
- Interactivity provided by shells (IPython, Spark shell) and also plotting
capability (Matplotlib, and Breeze Viz)
- For Scala, integration with Shark via sql2rdd etc.
- I'd like to create something similar to Scalding's Matrix API based on
RDDs for representing distributed matrices, as well as integrate the ideas
of Dmitry and Mahout's DistributedRowMatrix etc

Here is a rough outline of the model API I have used at the moment: This works nicely for ALS,
clustering, linear models etc.

So as you can see, mostly overlapping with what MLlib already has or has
planned in some way, but my main aim is frankly to have consistency in the
API, some level of abstraction but to keep things as simple as possible (ie
let Spark handle the complex stuff), and thus hopefully avoid things
becoming just a somewhat haphazard collection of models that is not that
simple to figure out how to use - which is unfortunately what I believe has
happened to Mahout.

So the question then is, how to work together or integrate? I see 3 options:

1. I go my own way (not very appealing obviously)
2. Contribute what I have (or as much as makes sense) to MLlib
3. Create my project as a "front-end" or "wrapper" around MLlib as the
core, effectively providing the API and workflow interface but using MLlib
as the model engine.

#2 is appealing but then a lot depends on the API and framework design and
how much what I have in mind is compatible with the rest of the devs etc
#3 now that I have written it, starts to sound pretty interesting -
potentially we're looking at a "front-end" that could in fact execute
models on Spark (or other engines like Hadoop/Mahout, GraphX etc), while
providing workflows for pipelining transformations, feature extraction,
testing and cross-validation, and data viz.

But of course #3 starts sounding somewhat like what MLBase is aiming to be
(I think)!

At this point I'm willing to show out what I have done so far on a
selective basis - be warned though it is rough and not finished and
somewhat clunky perhaps as it's my first attempt at a library/framework, if
it makes sense. Especially because really the main thing I did was the ALS
port, and with MLlib's version of ALS that may be less useful now in any

It may be that none of this is that useful to others anyway which is fine
as I'll keep developing tools that I need and potentially they will be
useful at some point.

Thoughts, feedback, comments, discussion? I really want to jump into MLlib
and get involved in contributing to standardised machine learning on Spark!


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message