spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin East <robin.e...@xense.co.uk>
Subject Re: Is Apache Spark less accurate than Scikit Learn?
Date Wed, 21 Jan 2015 23:57:44 GMT
I don’t get those results. I get:

spark		0.14
scikit-learn	0.85

The scikit-learn mse is due to the very low eta0 setting. Tweak that to 0.1 and push iterations
to 400 and you get a mse ~= 0. Of course the coefficients are both ~1 and the intercept ~0.
Similarly if you change the mllib step size to 0.5 and number of iterations to 1200 you again
get a very low mse. One of the issues with SGD is you have to tweak these parameters to tune
the algorithm.

FWIW I wouldn’t see Spark MLlib as a replacement for scikit-learn. MLLib is nowhere as mature
as scikit learn. However if you have large datasets that won’t sensibly fit the scikit-learn
in-core model MLLib is one of the top choices. Similarly if you are running proof of concepts
that you are eventually going to scale up to production environments then there is a definite
argument for using MLlib at both the PoC and production stages.


On 21 Jan 2015, at 20:39, JacquesH <jaaksemail@gmail.com> wrote:

> I've recently been trying to get to know Apache Spark as a replacement for
> Scikit Learn, however it seems to me that even in simple cases, Scikit
> converges to an accurate model far faster than Spark does.
> For example I generated 1000 data points for a very simple linear function
> (z=x+y) with the following script:
> 
> http://pastebin.com/ceRkh3nb
> 
> I then ran the following Scikit script:
> 
> http://pastebin.com/1aECPfvq
> 
> And then this Spark script: (with spark-submit <filename>, no other
> arguments)
> 
> http://pastebin.com/s281cuTL
> 
> Strangely though, the error given by spark is an order of magnitude larger
> than that given by Scikit (0.185 and 0.045 respectively) despite the two
> models having a nearly identical setup (as far as I can tell)
> I understand that this is using SGD with very few iterations and so the
> results may differ but I wouldn't have thought that it would be anywhere
> near such a large difference or such a large error, especially given the
> exceptionally simple data.
> 
> Is there something I'm misunderstanding in Spark? Is it not correctly
> configured? Surely I should be getting a smaller error than that?
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Apache-Spark-less-accurate-than-Scikit-Learn-tp21301.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message