mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <>
Subject Cluster text docs
Date Thu, 17 Dec 2009 23:30:37 GMT

What's the state of the world on clustering a raft of textual
documents? Are all the pieces in place to start from a directory of
flat text files, push through Lucene to get the vectors, keep labels
on the vectors to point back to the files, and run, say, k-means?

I've got enough data here that skimming off the top few unigrams might
also be advisable.

I tried running this through Weka, and blew it out of virtual memory.


View raw message