mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From paritosh ranjan <>
Subject Re: Mahout K-means has different behavior based on the number of mapping tasks
Date Wed, 26 Sep 2012 19:16:10 GMT
Each input split ( containing vectors in this case ) goes to a different
mapper task, and the clusters (models) are trained using the vectors
present in each mapper task, and the models are updated in the reducer.
This process is repeated till convergence/maxiteration. Since different
vectors went to different mapper tasks when two mapper tasks were used, so,
it took time (more iterations) to converge, and also the results after
first iteration were different.

Look into CIMapper and CIReducer classes for more/better explanation.

On Thu, Sep 27, 2012 at 12:03 AM, paritosh ranjan <
> wrote:

> And same set of centroids were used for both executions?
> On Wed, Sep 26, 2012 at 11:22 PM, nikos <> wrote:
>> The centroids have been selected in a previous execution of Mahout
>> K-means via randomSeed generator.
>> On 09/26/2012 08:43 PM, paritosh ranjan wrote:
>>> By saying "Using the a pre-selected set of initial centroids" do you mean
>>> that the initial centroids were same in both executions?
>>> In other words, how are you choosing your initial centroids?
>>> On Wed, Sep 26, 2012 at 10:40 PM, nikos <> wrote:
>>>  I experience a strange situation when running Mahout K-means: Using the
>>>> a
>>>> pre-selected set of initial centroids, I run K-means on a SequenceFile
>>>> generated by lucene.vector. The run is for testing purposes, so the
>>>> file is
>>>> small (around 10MB~10000 vectors).
>>>> When K-means is executed with a single mapper (the default considering
>>>> the
>>>> Hadoop split size which in my cluster is 128MB), it reaches a given
>>>> clustering result in 2 iterations (Case A). However, I wanted to test if
>>>> there would be any improvement/deterioration in the algorithm's
>>>> execution
>>>> speed by firing more mapping tasks (the Hadoop cluster has in total 6
>>>> nodes). I therefore set the -Dmapred.max.split.size parameter to 5242880
>>>> bytes, in order to make mahout fire 2 mapping tasks (Case B). I indeed
>>>> succeeded in starting two mappers, but the strange thing was that the
>>>> job
>>>> finished after 5 iterations instead of 2, and that even at the first
>>>> assignment of points to clusters, the mappers made different choices
>>>> compared to the single-map execution . What I mean is that after close
>>>> inspection of the clusterDump for the first iteration for both two
>>>> cases, I
>>>> found that in case B some points were not assigned to their closest
>>>> cluster.
>>>> Could this behavior be justified by the existing K-means Mahout
>>>> implementation?
>>>> Thanks in advance.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message