spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Pentreath <>
Subject Fwd: [Scikit-learn-general] Spark+sklearn sprint outcome ?
Date Tue, 04 Mar 2014 15:05:02 GMT
Thought that Spark users may be interested in the outcome of the Spark /
scikit-learn sprint that happened last month just after Strata...

---------- Forwarded message ----------
From: Olivier Grisel <>
Date: Fri, Feb 21, 2014 at 6:30 PM
Subject: Re: [Scikit-learn-general] Spark+sklearn sprint outcome ?
To: scikit-learn-general <>

2014-02-21 16:06 GMT+01:00 Eustache DIEMERT <>:
> Hi there,
> Could someone that attended the sprint send a rough summary ?
> I'd be particularly interested about the tested approaches, those that
> didn't work, those that seem promising and what the next steps could be

We started with a general discussion on PySpark. It naturally features
a Python wrapper to the mllib Scala distributed machine learning
library [1] that is optimized to work on Spark. However Python users
might still want to leverage existing numpy / scipy tools for some

The main difficulty to use numpy-aware tools efficiently is that
Sparks presents the data to the workers as an iterator over a large,
possibly cluster-partitioned collection of elements called a RDD. If
used naively one would load individual rows (1D numpy arrays) as
elements of an RDD to represent the content of a 2D data matrix. This
is not efficient because of the communication overhead between scala
and python workers and because it prevent to do efficient BLAS
operations that involve several rows at a time such as BLAS DGEMM
calls via for instance.

So this first issue was tackled by writing a block_rdd helper function
[2] to concatenate a bunch rows (e.g. 1D numpy arrays or list of
Python dicts) as chunked 2D numpy arrays or pandas DataFrame

This makes it possible to train linear model incrementally more
efficiently as done in [3]. Model averaging is done via a reduction

We also discussed how we could make it easier to plot the distribution
of data stored in a RDD and came up with the idea of computing
histograms on the spark side while exposing it with the same API as
the numpy.histogram function [4].

Have a look at the tests [5] for basic usage examples of all of the above.

There is also some high level discussion of the scope of the project in [6].


Olivier -

Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
Scikit-learn-general mailing list

View raw message