mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <shashik...@gmail.com>
Subject Re: Cluster text docs
Date Fri, 18 Dec 2009 06:36:18 GMT
I have done it.

Mahout doesn't have code to generate lucene index. We assume you have
already created the index. You can create vectors from lucene index
easily, run k-means, and use ClusterDumper to get the clusters, the
documents in that cluster and top features from centroid vector.
(Drew, cluster assignment is already there. Wonder why you had to redo
it.)

I have run k-means with quarter a million documents, each with 200
features (on an average). I don't recall the total number of features
in corpus, but I suspect with the optimizations to distance
calculations, it doesn't affect performance. Also, during vector
generations, the terms which are too frequent or too rare are ignored.
 I am able to run clustering on this set (100 random centroids, 10
iterations) in less than 30 minutes on a single host.

Drew, are you using the latest code? Overnight sounds too long.

--shashi

On Fri, Dec 18, 2009 at 5:00 AM, Benson Margulies <bimargulies@gmail.com> wrote:
> Gang,
>
> What's the state of the world on clustering a raft of textual
> documents? Are all the pieces in place to start from a directory of
> flat text files, push through Lucene to get the vectors, keep labels
> on the vectors to point back to the files, and run, say, k-means?
>
> I've got enough data here that skimming off the top few unigrams might
> also be advisable.
>
> I tried running this through Weka, and blew it out of virtual memory.
>
> --benson
>

Mime
View raw message