mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Stewart <>
Subject Re: Clustering a large crawl
Date Wed, 30 May 2012 17:39:15 GMT
What type of documents (news, web, pdf, etc.)?  How are your vectors constructed?  Are you
using TF*IDF on just uni-terms or n-grams, etc?  Do you stop or stem the content?  How do
you know the data contains the type of clusters you expect mahout to find?

Some suggestions:
Make sure you remove stop words and use stemming on remaining terms
Try using bi-grams instead of single terms
Try building clusters from document headlines only (in the case of news articles)

On May 30, 2012, at 1:23 PM, Pat Ferrel wrote:

> I have about 150,000 docs on which I ran canopy with values for t1 = t2 from 0.1 to 0.95
using the Cosine distance measure. I got results that range from 1.5 docs per cluster to 3.
In other words canopy produced a very large number of centroids, which does not seem to represent
the data very well. Trying random values for k seems to produce better results but still spotty
and hard to judge. I am at the point of giving up on canopy and so wrote a utility to simply
iterate k over some values and run the evaluators each time, but there are currently some
problems with CDbw (Inter-Cluster Density is always 0.0 for instance).
> This seems like such a fundamental problem that others must have found a way to get better
results. Any suggestions?

View raw message