mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <>
Subject Re: Cluster text docs
Date Fri, 18 Dec 2009 06:36:18 GMT
I have done it.

Mahout doesn't have code to generate lucene index. We assume you have
already created the index. You can create vectors from lucene index
easily, run k-means, and use ClusterDumper to get the clusters, the
documents in that cluster and top features from centroid vector.
(Drew, cluster assignment is already there. Wonder why you had to redo

I have run k-means with quarter a million documents, each with 200
features (on an average). I don't recall the total number of features
in corpus, but I suspect with the optimizations to distance
calculations, it doesn't affect performance. Also, during vector
generations, the terms which are too frequent or too rare are ignored.
 I am able to run clustering on this set (100 random centroids, 10
iterations) in less than 30 minutes on a single host.

Drew, are you using the latest code? Overnight sounds too long.


On Fri, Dec 18, 2009 at 5:00 AM, Benson Margulies <> wrote:
> Gang,
> What's the state of the world on clustering a raft of textual
> documents? Are all the pieces in place to start from a directory of
> flat text files, push through Lucene to get the vectors, keep labels
> on the vectors to point back to the files, and run, say, k-means?
> I've got enough data here that skimming off the top few unigrams might
> also be advisable.
> I tried running this through Weka, and blew it out of virtual memory.
> --benson

View raw message