spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JacquesH <>
Subject Is Apache Spark less accurate than Scikit Learn?
Date Wed, 21 Jan 2015 20:39:17 GMT
I've recently been trying to get to know Apache Spark as a replacement for
Scikit Learn, however it seems to me that even in simple cases, Scikit
converges to an accurate model far faster than Spark does.
For example I generated 1000 data points for a very simple linear function
(z=x+y) with the following script:

I then ran the following Scikit script:

And then this Spark script: (with spark-submit <filename>, no other

Strangely though, the error given by spark is an order of magnitude larger
than that given by Scikit (0.185 and 0.045 respectively) despite the two
models having a nearly identical setup (as far as I can tell)
I understand that this is using SGD with very few iterations and so the
results may differ but I wouldn't have thought that it would be anywhere
near such a large difference or such a large error, especially given the
exceptionally simple data.

Is there something I'm misunderstanding in Spark? Is it not correctly
configured? Surely I should be getting a smaller error than that?

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message