spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ameet Talwalkar <>
Subject Re: Machine Learning on Spark [long rambling discussion email]
Date Wed, 24 Jul 2013 20:09:23 GMT
Hi Nick,

Thanks for your email, and it's great to see such excitement around this
work!  Matei and Reynold already addressed the motivation behind MLlib as
well as our reasons for not using Breeze, and I'd like to give you some
background about MLbase, and discuss how it may fit with your interests.

There are three components of MLbase:

1) MLlib: As Matei mentioned, this is an ML library in Spark with core ML
kernels and solid implementations of common algorithms that can be used
easily by Java/Python and also called into by higher-level systems (e.g.
MLI, Shark, PySpark).

2) MLI: this is an ML API that provides a common interface for ML
algorithms (the same interface used in MLlib), and introduces high-level
abstractions to simplify feature extraction / exploration and ML algorithm
development.  These abstractions leverage the kernels in MLlib when
possible, and also introduce additional kernels.  This work also includes a
library written against the MLI.  The MLI is currently written against
Spark, but is designed to be platform independent, so that code written
against MLI could be run on different engines (e.g., Hadoop, GraphX, etc.).

3) ML Optimizer: This piece automates the task of model selection.  The
optimizer can be viewed as a search problem over feature extraction /
algorithms included in the MLI library, and is in part based on efficient
cross validation. This work is under active development but is in an
earlier stage of development than MLlib and MLI.

(note: MLlib will be included with the Spark codebase, while the MLI and ML
Optimizer will live in separate repositories.)

As far as I can tell (though please correct me if I've misunderstood) your
main goals include:

i) "consistency in the API"
ii) "some level of abstraction but to keep things as simple as possible"
iii) "execute models on Spark ... while providing workflows for pipelining
transformations, feature extraction, testing and cross-validation, and data

The MLI (and to some extent the ML Optimizer) is very much in line with
these goals, and it would be great if you were interested in contributing
to it.  MLI is a private repository right now, but we'll make it public
soon though, and Evan Sparks or I will let you know when we do so.

Thanks again for getting in touch with us!


On Wed, Jul 24, 2013 at 11:47 AM, Reynold Xin <> wrote:

> On Wed, Jul 24, 2013 at 1:46 AM, Nick Pentreath <
> >wrote:
> >
> > I also found Breeze to be very nice to work with and like the DSL - hence
> > my question about why not use that? (Especially now that Breeze is
> actually
> > just breeze-math and breeze-viz).
> >
> Matei addressed this from a higher level. I want to provide a little bit
> more context. A common properties of a lot of high level Scala DSL
> libraries is that simple operators tend to have high virtual function
> overheads and also create a lot of temporary objects. And because the level
> of abstraction is so high, it is fairly hard to debug / optimize
> performance.
> --
> Reynold Xin, AMPLab, UC Berkeley

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message