spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Heunis <>
Subject Re: Is Apache Spark less accurate than Scikit Learn?
Date Thu, 22 Jan 2015 06:44:25 GMT
Ah I see, thanks!
I was just confused because given the same configuration, I would have
thought that Spark and Scikit would give more similar results, but I guess
this is simply not the case (as in your example, in order to get spark to
give an mse sufficiently close to scikit's you have to give it a
significantly larger step and iteration count).

Would that then be a result of MLLib and Scikit differing slightly in their
exact implementation of the optimizer? Or rather a case of (as you say)
Scikit being a far more mature system (and therefore that MLLib would 'get
better' over time)? Surely it is far from ideal that to get the same
results you need more iterations (IE more computation), or do you think
that that is simply coincidence and that given a different model/dataset it
may be the other way around?

I ask because I encountered this situation on other, larger datasets, so
this is not an isolated case (though being the simplest example I could
think of I would imagine that it's somewhat indicative of general behaviour)

On Thu, Jan 22, 2015 at 1:57 AM, Robin East <> wrote:

> I don’t get those results. I get:
> spark           0.14
> scikit-learn    0.85
> The scikit-learn mse is due to the very low eta0 setting. Tweak that to
> 0.1 and push iterations to 400 and you get a mse ~= 0. Of course the
> coefficients are both ~1 and the intercept ~0. Similarly if you change the
> mllib step size to 0.5 and number of iterations to 1200 you again get a
> very low mse. One of the issues with SGD is you have to tweak these
> parameters to tune the algorithm.
> FWIW I wouldn’t see Spark MLlib as a replacement for scikit-learn. MLLib
> is nowhere as mature as scikit learn. However if you have large datasets
> that won’t sensibly fit the scikit-learn in-core model MLLib is one of the
> top choices. Similarly if you are running proof of concepts that you are
> eventually going to scale up to production environments then there is a
> definite argument for using MLlib at both the PoC and production stages.
> On 21 Jan 2015, at 20:39, JacquesH <> wrote:
> > I've recently been trying to get to know Apache Spark as a replacement
> for
> > Scikit Learn, however it seems to me that even in simple cases, Scikit
> > converges to an accurate model far faster than Spark does.
> > For example I generated 1000 data points for a very simple linear
> function
> > (z=x+y) with the following script:
> >
> >
> >
> > I then ran the following Scikit script:
> >
> >
> >
> > And then this Spark script: (with spark-submit <filename>, no other
> > arguments)
> >
> >
> >
> > Strangely though, the error given by spark is an order of magnitude
> larger
> > than that given by Scikit (0.185 and 0.045 respectively) despite the two
> > models having a nearly identical setup (as far as I can tell)
> > I understand that this is using SGD with very few iterations and so the
> > results may differ but I wouldn't have thought that it would be anywhere
> > near such a large difference or such a large error, especially given the
> > exceptionally simple data.
> >
> > Is there something I'm misunderstanding in Spark? Is it not correctly
> > configured? Surely I should be getting a smaller error than that?
> >
> >
> >
> > --
> > View this message in context:
> > Sent from the Apache Spark User List mailing list archive at
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > For additional commands, e-mail:
> >

View raw message