mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject Re: Mahout K-means has different behavior based on the number of mapping tasks
Date Wed, 26 Sep 2012 20:51:09 GMT
Very odd indeed. Each mapper will start with the same set of clusters 
and assign points to clusters (clusters observe the points) based upon 
the cluster centers (identical) and the chosen distance measure (also 
identical). At the end of the map step, each mapper sends its trained 
clusters (with observation statistics s0, s1 & s2) to the reducer(s) 
keyed by clusterId.

In the reducer, the trained clusters are accumulated by taking the first 
and observing all the subsequent clusters (with the same clusterId) with 
it. This is done by adding the s0, s1 and s2 values from each observed 

Finally, each cluster is closed and a new center & radius is calculated 
before it is output to begin the next iteration. If there is a problem 
in the implementation, it would be in the reducer where the 
accumulations occur.

On 9/26/12 3:16 PM, paritosh ranjan wrote:
> Each input split ( containing vectors in this case ) goes to a different
> mapper task, and the clusters (models) are trained using the vectors
> present in each mapper task, and the models are updated in the reducer.
> This process is repeated till convergence/maxiteration. Since different
> vectors went to different mapper tasks when two mapper tasks were used, so,
> it took time (more iterations) to converge, and also the results after
> first iteration were different.
> Look into CIMapper and CIReducer classes for more/better explanation.
> On Thu, Sep 27, 2012 at 12:03 AM, paritosh ranjan <
>> wrote:
>> And same set of centroids were used for both executions?
>> On Wed, Sep 26, 2012 at 11:22 PM, nikos <> wrote:
>>> The centroids have been selected in a previous execution of Mahout
>>> K-means via randomSeed generator.
>>> On 09/26/2012 08:43 PM, paritosh ranjan wrote:
>>>> By saying "Using the a pre-selected set of initial centroids" do you mean
>>>> that the initial centroids were same in both executions?
>>>> In other words, how are you choosing your initial centroids?
>>>> On Wed, Sep 26, 2012 at 10:40 PM, nikos <> wrote:
>>>>   I experience a strange situation when running Mahout K-means: Using the
>>>>> a
>>>>> pre-selected set of initial centroids, I run K-means on a SequenceFile
>>>>> generated by lucene.vector. The run is for testing purposes, so the
>>>>> file is
>>>>> small (around 10MB~10000 vectors).
>>>>> When K-means is executed with a single mapper (the default considering
>>>>> the
>>>>> Hadoop split size which in my cluster is 128MB), it reaches a given
>>>>> clustering result in 2 iterations (Case A). However, I wanted to test
>>>>> there would be any improvement/deterioration in the algorithm's
>>>>> execution
>>>>> speed by firing more mapping tasks (the Hadoop cluster has in total 6
>>>>> nodes). I therefore set the -Dmapred.max.split.size parameter to 5242880
>>>>> bytes, in order to make mahout fire 2 mapping tasks (Case B). I indeed
>>>>> succeeded in starting two mappers, but the strange thing was that the
>>>>> job
>>>>> finished after 5 iterations instead of 2, and that even at the first
>>>>> assignment of points to clusters, the mappers made different choices
>>>>> compared to the single-map execution . What I mean is that after close
>>>>> inspection of the clusterDump for the first iteration for both two
>>>>> cases, I
>>>>> found that in case B some points were not assigned to their closest
>>>>> cluster.
>>>>> Could this behavior be justified by the existing K-means Mahout
>>>>> implementation?
>>>>> Thanks in advance.

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message