spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanbo <yanboha...@gmail.com>
Subject Re: MLLib: LinearRegressionWithSGD performance
Date Mon, 24 Nov 2014 15:57:08 GMT
From the metrics page, it reveals that only two executors work parallel for each iteration.
You need to improve parallel threads numbers.
Some tips maybe helpful:
Increase "spark.default.parallelism";
Use repartition() or coalesce() to increase partition number.



> 在 2014年11月22日,上午3:18,Sameer Tilak <sstilak@live.com> 写道:
> 
> Hi All,
> I have been using MLLib's linear regression and I have some question regarding the performance.
We have a cluster of 10 nodes -- each node has 24 cores and 148GB memory. I am running my
app as follows:
> 
> time spark-submit --class medslogistic.MedsLogistic --master yarn-client --executor-memory
6G --num-executors 10 /pathtomyapp/myapp.jar
> 
> I am also going to play with number of executors (reduce it) may be that will give us
different results.  
> 
> The input is a 800MB sparse file in LibSVNM format. Total number of features is 150K.
It takes approximately 70 minutes for the regression to finish. The job imposes very little
load on CPU, memory, network, and disk. Total number of tasks is 104.  Total time gets divided
fairly uniformly across these tasks each task. I was wondering, is it possible to reduce the
execution time further? 
> <Screen Shot 2014-11-21 at 11.09.20 AM.png>
> <Screen Shot 2014-11-21 at 10.59.42 AM.png>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org

Mime
View raw message