mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Molek <>
Subject clusterpp is only writing directories for about half of my clusters.
Date Sat, 20 Oct 2012 19:44:37 GMT
First off, thank you everyone for your help so far. This mailing list
has been a great help getting me up and running with Mahout

Right now, I'm clustering a set of ~3M documents into 300 clusters.
Then I'm using clusterpp to split the documents up into directories
containing the vectors belonging to each cluster. After I perform the
clustering, clusterdump shows that each cluster has between ~800 and
~200,000 documents. This isn't a great spread, but the point is that
none of the clusters are empty.

Here are my commands:

bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
-dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
-k 300 -x 15 -cl -ow

bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt

bin/mahout clusterpp -i pca-clusters -o bottom

Since none of my clusters are empty, I would expect clusterpp to
create 300 directories in "bottom", one for each cluster. Instead,
only 147 directories are created. The other 153 outputs are just empty
part-r-* files sitting in the "bottom" directory.

I haven't found too much information when searching on this issue but
I did come across one mailing list post from a while back:

In that discussion someone said, "If that is the only thing that is
contained in the part-r-* file [it had no vectors], then the reducer
responsible to write to that part-r-* file did not receive any input
records to write to it. This happens because the program uses the
default hash partitioner which sometimes maps records belonging to
different clusters to a same reducer; thus leaving some reducers
without any input records."

So if that's correct, is that what's happening to me? Half of my
clusters are being sent to the overlapping reducers? That seems like a
big issue, making clusterpp pretty much useless for my purposes. I
can't have documents randomly being sent to the wrong cluster's
directory, especially not 50+% of them.

One final detail: I'm not sure if this matters, but the clusters
output by kmeans are not numbered 1 to 300. They have an odd looking,
nonsequential numbering sequence. The first 5 clusters are:

I haven't done much with kmeans before, so I wasn't sure if this was
an unexpected behavior or not.

View raw message