mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philippe Lamarche" <philippe.lamar...@gmail.com>
Subject Text clustering
Date Wed, 03 Dec 2008 16:48:16 GMT
Hi,

I have a questions concerning text clustering and the current
K-Means/vectors implementation.

For a school project, I did some text clustering with a subset of the Enron
corpus. I implemented a small M/R package that transforms text into TF-IDF
vector space, and then I used a little modified version of the
syntheticcontrol K-Means example. So far, all is fine.

However, the output of the k-mean algorithm is vector, as is the input. As I
understand it, when text is transformed in vector space, the cardinality of
the vector is the number of word in your global dictionary, all word in all
text being clustered. This, can grow up pretty quick. For example, with only
27000 Enron emails, even when removing word that only appears in 2 emails or
less, the dictionary size is about 45000 words.

My number one problem is this: how can we find out what document a vector is
representing, when it comes out of the k-means algorithm? My favorite
solution would be to have a unique id attached to each vector. Is there such
ID in the vector implementation? Is there a better solution? Is my approach
to text clustering wrong?

Thanks for the help,

Philippe.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message