mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nikos <>
Subject Re: Mahout K-means has different behavior based on the number of mapping tasks
Date Wed, 26 Sep 2012 17:52:50 GMT
The centroids have been selected in a previous execution of Mahout 
K-means via randomSeed generator.

On 09/26/2012 08:43 PM, paritosh ranjan wrote:
> By saying "Using the a pre-selected set of initial centroids" do you mean
> that the initial centroids were same in both executions?
> In other words, how are you choosing your initial centroids?
> On Wed, Sep 26, 2012 at 10:40 PM, nikos <> wrote:
>> I experience a strange situation when running Mahout K-means: Using the a
>> pre-selected set of initial centroids, I run K-means on a SequenceFile
>> generated by lucene.vector. The run is for testing purposes, so the file is
>> small (around 10MB~10000 vectors).
>> When K-means is executed with a single mapper (the default considering the
>> Hadoop split size which in my cluster is 128MB), it reaches a given
>> clustering result in 2 iterations (Case A). However, I wanted to test if
>> there would be any improvement/deterioration in the algorithm's execution
>> speed by firing more mapping tasks (the Hadoop cluster has in total 6
>> nodes). I therefore set the -Dmapred.max.split.size parameter to 5242880
>> bytes, in order to make mahout fire 2 mapping tasks (Case B). I indeed
>> succeeded in starting two mappers, but the strange thing was that the job
>> finished after 5 iterations instead of 2, and that even at the first
>> assignment of points to clusters, the mappers made different choices
>> compared to the single-map execution . What I mean is that after close
>> inspection of the clusterDump for the first iteration for both two cases, I
>> found that in case B some points were not assigned to their closest cluster.
>> Could this behavior be justified by the existing K-means Mahout
>> implementation?
>> Thanks in advance.

View raw message