mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Link <e...@ericmlink.com>
Subject Re: clusterpp is only writing directories for about half of my clusters.
Date Sat, 20 Oct 2012 20:25:35 GMT
We are looking at using mahout in our organization.  We have a need to do statistical analysis
and do clustering and make recommendations.  What is the 'sweet spot' for doing this with
mahout?  Meaning, what types of data sets and data volumes are the best fit for using a tool
like mahout, versus doing things, say,  in a sql database.  I hear big data doesn't really
start until you have terabytes and petabytes of data, so I'm not sure the data sets I have
are worthy!    Thanks for any thoughts on the proper fit for a tool like mahout.    - Eric



On Oct 20, 2012, at 2:44 PM, Matt Molek <mpmolek@gmail.com> wrote:

> First off, thank you everyone for your help so far. This mailing list
> has been a great help getting me up and running with Mahout
> 
> Right now, I'm clustering a set of ~3M documents into 300 clusters.
> Then I'm using clusterpp to split the documents up into directories
> containing the vectors belonging to each cluster. After I perform the
> clustering, clusterdump shows that each cluster has between ~800 and
> ~200,000 documents. This isn't a great spread, but the point is that
> none of the clusters are empty.
> 
> Here are my commands:
> 
> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
> -k 300 -x 15 -cl -ow
> 
> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
> 
> bin/mahout clusterpp -i pca-clusters -o bottom
> 
> 
> Since none of my clusters are empty, I would expect clusterpp to
> create 300 directories in "bottom", one for each cluster. Instead,
> only 147 directories are created. The other 153 outputs are just empty
> part-r-* files sitting in the "bottom" directory.
> 
> I haven't found too much information when searching on this issue but
> I did come across one mailing list post from a while back:
> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%3C4F3E52FC.7000000@windwardsolutions.com%3E
> 
> In that discussion someone said, "If that is the only thing that is
> contained in the part-r-* file [it had no vectors], then the reducer
> responsible to write to that part-r-* file did not receive any input
> records to write to it. This happens because the program uses the
> default hash partitioner which sometimes maps records belonging to
> different clusters to a same reducer; thus leaving some reducers
> without any input records."
> 
> So if that's correct, is that what's happening to me? Half of my
> clusters are being sent to the overlapping reducers? That seems like a
> big issue, making clusterpp pretty much useless for my purposes. I
> can't have documents randomly being sent to the wrong cluster's
> directory, especially not 50+% of them.
> 
> One final detail: I'm not sure if this matters, but the clusters
> output by kmeans are not numbered 1 to 300. They have an odd looking,
> nonsequential numbering sequence. The first 5 clusters are:
> VL-3740844
> VL-3741044
> VL-3741140
> VL-3741161
> VL-3741235
> 
> I haven't done much with kmeans before, so I wasn't sure if this was
> an unexpected behavior or not.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message