mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nikos <nkitm...@csd.auth.gr>
Subject Mahout K-means has different behavior based on the number of mapping tasks
Date Wed, 26 Sep 2012 17:10:06 GMT
I experience a strange situation when running Mahout K-means: Using the 
a pre-selected set of initial centroids, I run K-means on a SequenceFile 
generated by lucene.vector. The run is for testing purposes, so the file 
is small (around 10MB~10000 vectors).

When K-means is executed with a single mapper (the default considering 
the Hadoop split size which in my cluster is 128MB), it reaches a given 
clustering result in 2 iterations (Case A). However, I wanted to test if 
there would be any improvement/deterioration in the algorithm's 
execution speed by firing more mapping tasks (the Hadoop cluster has in 
total 6 nodes). I therefore set the -Dmapred.max.split.size parameter to 
5242880 bytes, in order to make mahout fire 2 mapping tasks (Case B). I 
indeed succeeded in starting two mappers, but the strange thing was that 
the job finished after 5 iterations instead of 2, and that even at the 
first assignment of points to clusters, the mappers made different 
choices compared to the single-map execution . What I mean is that after 
close inspection of the clusterDump for the first iteration for both two 
cases, I found that in case B some points were not assigned to their 
closest cluster.

Could this behavior be justified by the existing K-means Mahout 
implementation?

Thanks in advance.



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message