mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject k-Means questions
Date Thu, 25 Jun 2009 22:49:49 GMT
Do people have recommendations for start clusters (seeds) for k- 
Means.  The synthetic control example uses Canopy and I often see  
Random selection mentioned, but I'm wondering what's considered to be  
best practices for obtaining good overall results.

Also, how best to take the Random approach.  On a small data set, I  
can easily crank out a program to loop randomly select vectors, but it  
seems like in a HDFS environment, you'd need a M/R job just to do that  
initial selection of random documents.    Back in my parallel  
computation days (a _long_ time ago) on big old iron, I seem to recall  
there being work on parallel/distributed RNG, is that useful here or  
is that overkill?  Does Hadoop offer tools for this?

Also, is it just me, or does the KMeansDriver need to take in "k" or  
is this just assumed from the number of initial input clusters?


View raw message