mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nikos <>
Subject Re: Mahout K-means has different behavior based on the number of mapping tasks
Date Thu, 27 Sep 2012 10:17:09 GMT
Thank you for the answers,
so how could we check if there is a problem in the reducer?And if, 
indeed, there is could also explain why there are users that experience 
slow executions of K-means (
Also I have to mention that for (bigger) k near 100 again in the same 
dataset and same parameters and same initial centroids k-means converges 
when it runs on one mapper on two iterations but when I split the 
dataset in two mappers it does never converge and takes all the 
iterations until it finishes (even if I set -x 100).

On 09/26/12 23:51, Jeff Eastman wrote:
> Very odd indeed. Each mapper will start with the same set of clusters 
> and assign points to clusters (clusters observe the points) based upon 
> the cluster centers (identical) and the chosen distance measure (also 
> identical). At the end of the map step, each mapper sends its trained 
> clusters (with observation statistics s0, s1 & s2) to the reducer(s) 
> keyed by clusterId.
> In the reducer, the trained clusters are accumulated by taking the 
> first and observing all the subsequent clusters (with the same 
> clusterId) with it. This is done by adding the s0, s1 and s2 values 
> from each observed cluster.
> Finally, each cluster is closed and a new center & radius is 
> calculated before it is output to begin the next iteration. If there 
> is a problem in the implementation, it would be in the reducer where 
> the accumulations occur.
> On 9/26/12 3:16 PM, paritosh ranjan wrote:
>> Each input split ( containing vectors in this case ) goes to a different
>> mapper task, and the clusters (models) are trained using the vectors
>> present in each mapper task, and the models are updated in the reducer.
>> This process is repeated till convergence/maxiteration. Since different
>> vectors went to different mapper tasks when two mapper tasks were 
>> used, so,
>> it took time (more iterations) to converge, and also the results after
>> first iteration were different.
>> Look into CIMapper and CIReducer classes for more/better explanation.
>> On Thu, Sep 27, 2012 at 12:03 AM, paritosh ranjan 
>> <
>>> wrote:
>>> And same set of centroids were used for both executions?
>>> On Wed, Sep 26, 2012 at 11:22 PM, nikos <> wrote:
>>>> The centroids have been selected in a previous execution of Mahout
>>>> K-means via randomSeed generator.
>>>> On 09/26/2012 08:43 PM, paritosh ranjan wrote:
>>>>> By saying "Using the a pre-selected set of initial centroids" do 
>>>>> you mean
>>>>> that the initial centroids were same in both executions?
>>>>> In other words, how are you choosing your initial centroids?
>>>>> On Wed, Sep 26, 2012 at 10:40 PM, nikos <>
>>>>>   I experience a strange situation when running Mahout K-means: 
>>>>> Using the
>>>>>> a
>>>>>> pre-selected set of initial centroids, I run K-means on a 
>>>>>> SequenceFile
>>>>>> generated by lucene.vector. The run is for testing purposes, so the
>>>>>> file is
>>>>>> small (around 10MB~10000 vectors).
>>>>>> When K-means is executed with a single mapper (the default 
>>>>>> considering
>>>>>> the
>>>>>> Hadoop split size which in my cluster is 128MB), it reaches a given
>>>>>> clustering result in 2 iterations (Case A). However, I wanted to

>>>>>> test if
>>>>>> there would be any improvement/deterioration in the algorithm's
>>>>>> execution
>>>>>> speed by firing more mapping tasks (the Hadoop cluster has in 
>>>>>> total 6
>>>>>> nodes). I therefore set the -Dmapred.max.split.size parameter to

>>>>>> 5242880
>>>>>> bytes, in order to make mahout fire 2 mapping tasks (Case B). I 
>>>>>> indeed
>>>>>> succeeded in starting two mappers, but the strange thing was that

>>>>>> the
>>>>>> job
>>>>>> finished after 5 iterations instead of 2, and that even at the first
>>>>>> assignment of points to clusters, the mappers made different choices
>>>>>> compared to the single-map execution . What I mean is that after

>>>>>> close
>>>>>> inspection of the clusterDump for the first iteration for both two
>>>>>> cases, I
>>>>>> found that in case B some points were not assigned to their closest
>>>>>> cluster.
>>>>>> Could this behavior be justified by the existing K-means Mahout
>>>>>> implementation?
>>>>>> Thanks in advance.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message