mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <>
Subject Re: Cluster text docs
Date Fri, 18 Dec 2009 03:36:28 GMT
I have a large pile of Hebrew news articles. I want to cluster them so
that I can select a disparate subset for initial tagging of a named
entity extraction model.

On Thu, Dec 17, 2009 at 10:34 PM, Drew Farris <> wrote:
> Hi Benson,
> I've managed to go from a lucene index to k-means output with a couple
> smaller corpora. One around 500k items, about 1M total/100k unique
> tokens and another with about half that number of items but with about
> 3M total/300k unique tokens (unigrams in some cases and a mixture of
> unigrams and a limited set of bigrams in another). I ended up doing a
> number of runs with various settings, but somewhat arbitrarily I ended
> up filtering out terms that appeared in less than 8 items. I started
> with 1000 random centroids and ran 10 iterations. These runs were able
> to complete overnight on my minuscule 2 machine cluster I use for
> testing, the probably would have run without a problem without using a
> cluster at all. I never did go back an check to see if they had
> converged before running all 10 iterations.
> In each case I had the tools to inject item labels and tokens into a
> lucene index already, so I did not have to use any mahout provided
> tools to set that up. It would be nice to provide a tool that did
> this, but what general-purpose tokenization pipeline should be used?
> In my case I was using a processor based on something developed
> internally for another project.
> Nevertheless, the lucene index had a stored field for document labels
> and an tokenized, indexed field with term vectors stored from which
> the tokens were extracted. Using o.a.m.utils.vectors.lucene.Driver, I
> was able to produce vectors suitable as a starting point for k-means.
> After running, k-means emits cluster and point data. Everything can be
> dumped using o.a.m.utils.clustering.ClusterDumper, which takes the
> clustering output and the dictionary file produced by the
> lucene.Driver and produces a text file containing what I believe to be
> a gson(?) representation of the SparseVector representing the centroid
> of the cluster (need to verify this), the top terms found in the
> cluster,  and the labels of the items that fell into that cluster.
> I've managed to opened up the ClusterDumper code and produce something
> that emits documents and their cluster assignments to support the
> investigation I'm doing.
> I have not done an exhaustive amount validation on the output, but
> based on what I have done, the results look very promising.
> I've tried to run LDA on the same corpora, but haven't met with any
> success. I'm under the impression that I'm either doing something
> horribly wrong, or the scaling characteristics of the algorithm are
> quite different than k-means. I haven't managed to get my head around
> the algorithm or read the code enough to figure out what the problem
> could be at this point.
> What are the characteristics of the collection of documents are you
> attempting to cluster?
> Drew
> On Thu, Dec 17, 2009 at 6:30 PM, Benson Margulies <> wrote:
>> Gang,
>> What's the state of the world on clustering a raft of textual
>> documents? Are all the pieces in place to start from a directory of
>> flat text files, push through Lucene to get the vectors, keep labels
>> on the vectors to point back to the files, and run, say, k-means?
>> I've got enough data here that skimming off the top few unigrams might
>> also be advisable.
>> I tried running this through Weka, and blew it out of virtual memory.
>> --benson

View raw message