mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Clustering from DB
Date Thu, 23 Jul 2009 12:54:33 GMT

On Jul 22, 2009, at 10:22 AM, nfantone wrote:

> After setting the cluster up with 6 computers (two of them being
> QuadCore and the others, DualCore, totaling 16 slave cores) and
> running a KMeansDriver job with 32 reduce tasks and ~80 map tasks
> spawned it's STILL awfully slow.
> ./bin/hadoop jar ~/mahout-core-0.2.jar
> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/ -c
> init -o output -r 32 -d 0.001 -k 200
> Using a pretty small dataset of 62MB it took more than a whole day to
> complete. Datanodes and Jobtrackers logs don't show any visible
> errors, either. Would you mind sharing any piece of advice that could
> help me tune this thing up with my settings?

That does seem like a long time.

Is your data sparse or dense?

Perhaps a larger convergence value might help (-d, I believe).

Is there any chance your data is publicly shareable?  Come to think of  
it, with the vector representations, as long as you don't publish the  
key (which terms map to which index), I would think most all data is  
publicly shareable.

Are you on trunk of Mahout?  I think we still need more profiling to  
get a better idea of where improvements can be made.


View raw message