mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity
Date Mon, 13 Oct 2014 16:53:37 GMT
There is a possibility that we are doing something with partitioning that interferes but I
think Ted’s point is that Spark should do the right thing in most cases—unless we interfere.
Those values are meant for tuning to the exact job you are doing, but it may not be appropriate
for us to hard code them. We could allow the CLI to set them like we do with -sem if needed.

Let’s see what Dmitriy thinks about why only one task is being created.

On Oct 13, 2014, at 9:32 AM, Reinis Vicups <mahout@orbit-x.de> wrote:

Hi,

> Do you think that simply increasing this parameter is a safe and sane thing
> to do?

Why would it be unsafe?

In my own implementation I am using 400 tasks on my 4-node-2cpu cluster and the execution
times of largest shuffle stage have dropped around 10 times.
I have number of test values back from the time when I used "old" RowSimilarityJob and with
some exceptions (I guess due to randomized sparsization) I still have approx. the same values
with my own row similarity implementation.

reinis

On 13.10.2014 18:06, Ted Dunning wrote:
> On Mon, Oct 13, 2014 at 11:56 AM, Reinis Vicups <mahout@orbit-x.de> wrote:
> 
>> I have my own implementation of SimilarityAnalysis and by tuning number of
>> tasks I have reached HUGE performance gains.
>> 
>> Since I couldn't find how to pass the number of tasks to shuffle
>> operations directly, I have set following in spark config
>> 
>> configuration = new SparkConf().setAppName(jobConfig.jobName)
>>         .set("spark.serializer", "org.apache.spark.serializer.
>> KryoSerializer")
>>         .set("spark.kryo.registrator", "org.apache.mahout.sparkbindings.io
>> .MahoutKryoRegistrator")
>>         .set("spark.kryo.referenceTracking", "false")
>>         .set("spark.kryoserializer.buffer.mb", "200")
>>         .set("spark.default.parallelism", 400) // <- this is the line
>> supposed to set default parallelism to some high number
>> 
>> Thank you for your help
>> 
> Thank you for YOUR help!
> 
> Do you think that simply increasing this parameter is a safe and sane thing
> to do?
> 



Mime
View raw message