mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Lu <...@atypon.com>
Subject Re: LDA on single node is much faster than 20 nodes
Date Tue, 06 Sep 2011 23:32:52 GMT
btw: the output shows the default number of map tasks is changed to 40. 
However, there is only one map task.

2011-09-06 22:20:34,069 INFO org.apache.mahout.clustering.lda.LDADriver 
(main): LDA Iteration 1
2011-09-06 22:20:34,156 INFO org.apache.hadoop.mapred.JobClient (main): 
Default number of map tasks: 40
2011-09-06 22:20:34,157 INFO org.apache.hadoop.mapred.JobClient (main): 
Default number of reduce tasks: 33
2011-09-06 22:20:36,381 INFO 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total 
input paths to process : 1
2011-09-06 22:20:40,420 INFO org.apache.hadoop.mapred.JobClient (main): 
Running job: job_201109062203_0002
2011-09-06 22:20:41,422 INFO org.apache.hadoop.mapred.JobClient (main):  
map 0% reduce 0%
2011-09-06 22:39:52,669 INFO org.apache.hadoop.mapred.JobClient (main):  
map 1% reduce 0%


On 09/06/2011 03:57 PM, Chris Lu wrote:
> Thanks. Very helpful to me!
>
> I tried to change the setting of "mapred.map.tasks".  However, the 
> number map task is still just one on one of the 20 machines.
>
> ./elastic-mapreduce --create --alive \
>    --num-instances 20 --name "LDA" \
>    --bootstrap-action 
> s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
>    --bootstrap-name "Configuring number of map tasks per job" \
>    --args "-m,mapred.map.tasks=40"
>
> Anyone knows how to configure the number of mappers?
> Again, the input size is only 46M.
>
> Chris
>
> On 09/06/2011 12:09 PM, Ted Dunning wrote:
>> Well, I think that using small instances is a disaster in general.  The
>> performance that you get from them can vary easily by an order of 
>> magnitude.
>>   My own preference for real work is either m2xl or cc14xl.  The latter
>> machines give you nearly bare metal performance and no noisy 
>> neighbors.  The
>> m2xl is typically very much underpriced on the spot market.
>>
>> Sean is right about your job being misconfigured.  The Hadoop 
>> overhead is
>> considerable and you have only given it two threads to overcome that
>> overhead.
>>
>> On Tue, Sep 6, 2011 at 6:12 PM, Sean Owen<srowen@gmail.com>  wrote:
>>
>>> That's your biggest issue, certainly. Only 2 mappers are running, even
>>> though you have 20 machines available. Hadoop determines the number of
>>> mappers based on input size, and your input isn't so big that it 
>>> thinks you
>>> need 20 workers. It's launching 33 reducers, so your cluster is put 
>>> to use
>>> there. But it's no wonder you're not seeing anything like 20x 
>>> speedup in
>>> the
>>> mapper.
>>>
>>> You can of course force it to use more mappers, and that's probably 
>>> a good
>>> idea here. -Dmapred.map.tasks=20 perhaps. More mappers means more 
>>> overhead
>>> of spinning up mappers to process less data, and Hadoop's guess 
>>> indicates
>>> that it thinks it's not efficient to use 20 workers. If you know 
>>> that those
>>> other 18 are otherwise idle, my guess is you'd benefit from just 
>>> making it
>>> use 20.
>>>
>>> If this were a general large cluster where many people are taking 
>>> advantage
>>> of the workers, then I'd trust Hadoop's guesses until you are sure  you
>>> want
>>> to do otherwise.
>>>
>>> On Tue, Sep 6, 2011 at 7:02 PM, Chris Lu<clu@atypon.com>  wrote:
>>>
>>>> Thanks for all the suggestions!
>>>>
>>>> All the inputs are the same. It takes 85 hours for 4 iterations on 20
>>>> Amazon small machines. On my local single node, it got to iteration 19
>>> for
>>>> also 85 hours.
>>>>
>>>> Here is a section of the Amazon log output.
>>>> It covers the start of iteration 1, and between iteration 4 and 
>>>> iteration
>>>> 5.
>>>>
>>>> The number of map tasks is set to 2. Should it be larger or related to
>>>> number of CPU cores?
>>>>
>>>>
>


Mime
View raw message