mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reinis Vicups <mah...@orbit-x.de>
Subject ClusterOutputPostProcessor: what is the purpose of clusterMappings
Date Thu, 08 May 2014 14:45:58 GMT
Hi,

in mahout 0.8 I see that ClusterOutputPostProcessorMapper and -Reducer 
are using Map<Integer, Integer> *ClusterMappings = 
ClusterCountReader.getClusterIDs(clusterOutputPath, conf, <true|false>).

This map alows to map clusterIds to index of 0 to k-1 where k is the 
number of clusters.

What is the purpose of this mapping?

clusterIds themselves are int thus the mapping to an index (and reverse 
mapping in Reducer back from index) seems to me useless.

Since clusterpp is setting number of reducers equal to k I thought 
initially this design is used to ensure that each cluster is given to a 
separate reducer but this should be true even without mapping.

What reducer gets as a key IF we are doind mapping is this: 0, 1, 2, 3, 
4, 5, 6, ...
Without mapping the reducer gets keys like this: 345, 37636, 14, 47699, 
234576, ...

But the clustered points will still be shuffled by cluster id when 
passed to reducer.

So what gives?

Thank you, guys, for your hints
reinis.

Mime
View raw message