spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Pentreath <>
Subject Re: [Scikit-learn-general] Spark-backed implementations of scikit-learn estimators
Date Wed, 27 Nov 2013 06:48:47 GMT
CC'ing Spark Dev list

I have been thinking about this for quite a while and would really love to
see this happen.

Most of my pipeline ends up in Scala/Spark these days - which I love, but
it is partly because I am reliant on custom Hadoop input formats that are
just way easier to use from Scala/Java - but I still use Python a lot for
data analysis and interactive work. There is some good stuff happening with
Breeze in Scala and MLlib in Spark (and IScala) but the breadth just
doesn't compare as yet - not to mention IPython and plotting!

There is a PR that was just merged into PySpark to allow arbitrary
serialization protocols between the Java and Python layers. I hope to try
to use this to allow PySpark users to pull data from arbitrary Hadoop
InputFormats with minimum fuss. This I believe will open the way for many
(including me!) to use PySpark directly for virtually all distributed data
processing without "needing" to use Java ( (

Linked to this is what I believe is huge potential to add distributed
PySpark versions of many algorithms in scikit-learn (and elsewhere). The
idea as intimated above, would be to have estimator classes with sklearn
compatible APIs. They may in turn use sklearn algorithms themselves (eg:
this shows how easy it would be for linear models:

I'd be very happy to try to find some time to work on such a library (I had
started one in Scala that was going to contain a Python library also, but
I've just not had the time available and with Spark MLlib appearing and the
Hadoop stuff I had what I needed for my systems).

The main benefit I see is that sklearn already has:
- many algorithms to work with
- great, simple API
- very useful stuff like preprocessing, vectorizing and feature hashing
(very important for large scale linear models)
- obviously the nice Python ecosystem stuff like plotting, IPython
notebook, pandas, scikit-statsmodels and so on.

The easiest place to start in my opinion is to take a few of the basic
models in the PySpark examples and turn them into production-ready code
that utilises sklearn or other good libraries as much as possible.

(I think this library would live outside of both Spark and sklearn, at
least until it is clear where it should live).

I would be happy to help and provide Spark-related advice even if I cannot
find enough time to work on many algorithms. Though I do hope to find more
time toward the end of the year and early next year.


On Wed, Nov 27, 2013 at 12:42 AM, Uri Laserson <>wrote:

> Hi all,
> I was wondering whether there has been any organized effort to create
> scikit-learn estimators that are backed by Spark clusters.  Rather than
> using the PySpark API to call sklearn functions, you would instantiate
> sklearn estimators that end up calling PySpark functionality in their
> .fit() methods.
> Uri
> ......................................
> Uri Laserson
> +1 617 910 0447
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> _______________________________________________
> Scikit-learn-general mailing list

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message