mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reinis Vicups <mah...@orbit-x.de>
Subject Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity
Date Mon, 13 Oct 2014 15:56:13 GMT
Hi,

I am currently testing SimilarityAnalysis.rowSimilarity and I am 
wondering, how could I increase number of tasks to use for distributed 
shuffle.

What I currently observe, is that SimilarityAnalysis is requiring almost 
20 minutes for my dataset only with this stage:

combineByKey at ABt.scala:126

When I view details for the stage I see that only one task is spawned 
running on one node.

I have my own implementation of SimilarityAnalysis and by tuning number 
of tasks I have reached HUGE performance gains.

Since I couldn't find how to pass the number of tasks to shuffle 
operations directly, I have set following in spark config

configuration = new SparkConf().setAppName(jobConfig.jobName)
         .set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
         .set("spark.kryo.registrator", 
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
         .set("spark.kryo.referenceTracking", "false")
         .set("spark.kryoserializer.buffer.mb", "200")
         .set("spark.default.parallelism", 400) // <- this is the line 
supposed to set default parallelism to some high number

Thank you for your help
reinis


Mime
View raw message