mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Clustering a large crawl
Date Wed, 30 May 2012 17:23:30 GMT
I have about 150,000 docs on which I ran canopy with values for t1 = t2 
from 0.1 to 0.95 using the Cosine distance measure. I got results that 
range from 1.5 docs per cluster to 3. In other words canopy produced a 
very large number of centroids, which does not seem to represent the 
data very well. Trying random values for k seems to produce better 
results but still spotty and hard to judge. I am at the point of giving 
up on canopy and so wrote a utility to simply iterate k over some values 
and run the evaluators each time, but there are currently some problems 
with CDbw (Inter-Cluster Density is always 0.0 for instance).

This seems like such a fundamental problem that others must have found a 
way to get better results. Any suggestions?

View raw message