mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Molek <mpmo...@gmail.com>
Subject Re: clusterpp is only writing directories for about half of my clusters.
Date Sun, 21 Oct 2012 01:01:45 GMT
Thanks for the quick response!

I will do some testing tomorrow with various numbers of clusters and
create a JIRA once I have those results. I might be able to contribute
a patch for this if I have the time.

On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan
<paritoshranjan5@gmail.com> wrote:
> "So if that's correct, is that what's happening to me? Half of my
> clusters are being sent to the overlapping reducers? That seems like a
> big issue, making clusterpp pretty much useless for my purposes. I
> can't have documents randomly being sent to the wrong cluster's
> directory, especially not 50+% of them."
>
> This might be correct. I think this can occur if the number of clusters is
> large, and the testing was not done with so many clusters.
> Can you help a bit in testing some scenarios?
>
> a) Try reducing the number of clusters to 100 and then 50. The motto is to
> find the breaking point (number of clusters) after which the clusters start
> converging. If this is found, then we would be sure that the problem lies
> in the partitioner.
> b) If you want, try to use a different partitioner/s. The idea is to create
> as many reducer tasks as the number of ( non empty ) clusters found, so
> that vectors present in each cluster is in a separate file and later they
> are moved to their respective directories ( named on cluster id ).
>
> Please also create a JIRA for this.
> https://issues.apache.org/jira/browse/MAHOUT.
> And if you are interested, this would be a good starting point to
> contribute to Mahout also.
>
> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <mpmolek@gmail.com> wrote:
>
>> First off, thank you everyone for your help so far. This mailing list
>> has been a great help getting me up and running with Mahout
>>
>> Right now, I'm clustering a set of ~3M documents into 300 clusters.
>> Then I'm using clusterpp to split the documents up into directories
>> containing the vectors belonging to each cluster. After I perform the
>> clustering, clusterdump shows that each cluster has between ~800 and
>> ~200,000 documents. This isn't a great spread, but the point is that
>> none of the clusters are empty.
>>
>> Here are my commands:
>>
>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
>> -k 300 -x 15 -cl -ow
>>
>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
>>
>> bin/mahout clusterpp -i pca-clusters -o bottom
>>
>>
>> Since none of my clusters are empty, I would expect clusterpp to
>> create 300 directories in "bottom", one for each cluster. Instead,
>> only 147 directories are created. The other 153 outputs are just empty
>> part-r-* files sitting in the "bottom" directory.
>>
>> I haven't found too much information when searching on this issue but
>> I did come across one mailing list post from a while back:
>>
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%3C4F3E52FC.7000000@windwardsolutions.com%3E
>>
>> In that discussion someone said, "If that is the only thing that is
>> contained in the part-r-* file [it had no vectors], then the reducer
>> responsible to write to that part-r-* file did not receive any input
>> records to write to it. This happens because the program uses the
>> default hash partitioner which sometimes maps records belonging to
>> different clusters to a same reducer; thus leaving some reducers
>> without any input records."
>>
>> So if that's correct, is that what's happening to me? Half of my
>> clusters are being sent to the overlapping reducers? That seems like a
>> big issue, making clusterpp pretty much useless for my purposes. I
>> can't have documents randomly being sent to the wrong cluster's
>> directory, especially not 50+% of them.
>>
>> One final detail: I'm not sure if this matters, but the clusters
>> output by kmeans are not numbered 1 to 300. They have an odd looking,
>> nonsequential numbering sequence. The first 5 clusters are:
>> VL-3740844
>> VL-3741044
>> VL-3741140
>> VL-3741161
>> VL-3741235
>>
>> I haven't done much with kmeans before, so I wasn't sure if this was
>> an unexpected behavior or not.
>>

Mime
View raw message