mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Tomsett" <indigentmart...@gmail.com>
Subject Re: Text clustering
Date Fri, 05 Dec 2008 11:05:29 GMT
Sure :-) I haven't got my project on me at the moment but should be able to
get at it some time before Xmas so will look through it again and send you
anything that may be useful.


2008/12/5 Grant Ingersoll <gsingers@apache.org>

> I seem to recall some discussion a while back about being able to add
> labels to the vectors/matrices, but I don't know the status of the patch.
>
> At any rate, very cool that you are using it for text clustering.  I still
> have on my list to write up how to do this and to write some supporting code
> as well.  So, if either of you cares to contribute, that would be most
> useful.
>
> -Grant
>
>
> On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote:
>
>  Hi Phillippe,
>>
>> I used the K-Means on TF-IDF vectors and wondered the same thing - about
>> labelling the documents. I haven't got my code on me at the moment and it
>> was a few months ago that I last looked at it (so I was also probably
>> using
>> an older version of Mahout)... but I seem to remember that I did just as
>> you
>> are suggesting and simply attached a unique ID to each document which got
>> passed through the map-reduce stages. This requires a bit of tinkering
>> with
>> the K-Means implementation but shouldn't be too much work.
>>
>> As for having massive vectors, you could try representing them as sparse
>> vectors rather than the dense vectors the standard Mahout K-Means
>> algorithm
>> accepts, which gets rid of all the zero values in the document vectors.
>> See
>> the Javadoc for details, it'll be more reliable than my memory :-)
>>
>> Richard
>>
>>
>> 2008/12/3 Philippe Lamarche <philippe.lamarche@gmail.com>
>>
>>  Hi,
>>>
>>> I have a questions concerning text clustering and the current
>>> K-Means/vectors implementation.
>>>
>>> For a school project, I did some text clustering with a subset of the
>>> Enron
>>> corpus. I implemented a small M/R package that transforms text into
>>> TF-IDF
>>> vector space, and then I used a little modified version of the
>>> syntheticcontrol K-Means example. So far, all is fine.
>>>
>>> However, the output of the k-mean algorithm is vector, as is the input.
>>> As
>>> I
>>> understand it, when text is transformed in vector space, the cardinality
>>> of
>>> the vector is the number of word in your global dictionary, all word in
>>> all
>>> text being clustered. This, can grow up pretty quick. For example, with
>>> only
>>> 27000 Enron emails, even when removing word that only appears in 2 emails
>>> or
>>> less, the dictionary size is about 45000 words.
>>>
>>> My number one problem is this: how can we find out what document a vector
>>> is
>>> representing, when it comes out of the k-means algorithm? My favorite
>>> solution would be to have a unique id attached to each vector. Is there
>>> such
>>> ID in the vector implementation? Is there a better solution? Is my
>>> approach
>>> to text clustering wrong?
>>>
>>> Thanks for the help,
>>>
>>> Philippe.
>>>
>>>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message