mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reinis Vicups <>
Subject Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity
Date Mon, 13 Oct 2014 17:06:46 GMT
Of course the number of partitions/tasks shall be configurable, I am 
just saying that in my experiments I have observed a close-to-linear 
performance increase just by increasing number of partitions/tasks 
(which was absolutely not the case with map-reduce).

I am assuming that spark is not "smart" enough to set optimal values for 
the parallelism. I recall reading someplace that the default is number 
of CPUs or 2 - whatever is larger. Because of the task nature (if I am 
not mistaken, those are wrapped akka actors) it is possible to 
efficiently execute a way higher number of tasks per CPU. They suggest 
this but 
I have observed sometimes considerable performance gains when increasing 
number of tasks to 8 or even to 16 per CPU core.

On 13.10.2014 18:53, Pat Ferrel wrote:
> There is a possibility that we are doing something with partitioning that interferes
but I think Ted’s point is that Spark should do the right thing in most cases—unless we
interfere. Those values are meant for tuning to the exact job you are doing, but it may not
be appropriate for us to hard code them. We could allow the CLI to set them like we do with
-sem if needed.
> Let’s see what Dmitriy thinks about why only one task is being created.
> On Oct 13, 2014, at 9:32 AM, Reinis Vicups <> wrote:
> Hi,
>> Do you think that simply increasing this parameter is a safe and sane thing
>> to do?
> Why would it be unsafe?
> In my own implementation I am using 400 tasks on my 4-node-2cpu cluster and the execution
times of largest shuffle stage have dropped around 10 times.
> I have number of test values back from the time when I used "old" RowSimilarityJob and
with some exceptions (I guess due to randomized sparsization) I still have approx. the same
values with my own row similarity implementation.
> reinis
> On 13.10.2014 18:06, Ted Dunning wrote:
>> On Mon, Oct 13, 2014 at 11:56 AM, Reinis Vicups <> wrote:
>>> I have my own implementation of SimilarityAnalysis and by tuning number of
>>> tasks I have reached HUGE performance gains.
>>> Since I couldn't find how to pass the number of tasks to shuffle
>>> operations directly, I have set following in spark config
>>> configuration = new SparkConf().setAppName(jobConfig.jobName)
>>>          .set("spark.serializer", "org.apache.spark.serializer.
>>> KryoSerializer")
>>>          .set("spark.kryo.registrator", "
>>> .MahoutKryoRegistrator")
>>>          .set("spark.kryo.referenceTracking", "false")
>>>          .set("spark.kryoserializer.buffer.mb", "200")
>>>          .set("spark.default.parallelism", 400) // <- this is the line
>>> supposed to set default parallelism to some high number
>>> Thank you for your help
>> Thank you for YOUR help!
>> Do you think that simply increasing this parameter is a safe and sane thing
>> to do?

View raw message