Hi,
I have a questions concerning text clustering and the current
KMeans/vectors implementation.
For a school project, I did some text clustering with a subset of the Enron
corpus. I implemented a small M/R package that transforms text into TFIDF
vector space, and then I used a little modified version of the
syntheticcontrol KMeans example. So far, all is fine.
However, the output of the kmean algorithm is vector, as is the input. As I
understand it, when text is transformed in vector space, the cardinality of
the vector is the number of word in your global dictionary, all word in all
text being clustered. This, can grow up pretty quick. For example, with only
27000 Enron emails, even when removing word that only appears in 2 emails or
less, the dictionary size is about 45000 words.
My number one problem is this: how can we find out what document a vector is
representing, when it comes out of the kmeans algorithm? My favorite
solution would be to have a unique id attached to each vector. Is there such
ID in the vector implementation? Is there a better solution? Is my approach
to text clustering wrong?
Thanks for the help,
Philippe.
