mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Text clustering
Date Fri, 05 Dec 2008 13:40:25 GMT

On Dec 5, 2008, at 6:05 AM, Richard Tomsett wrote:

> Sure :-) I haven't got my project on me at the moment but should be  
> able to
> get at it some time before Xmas so will look through it again and  
> send you
> anything that may be useful.

Cool, just add a patch to JIRA, if you can.  I think we could work  
together to create a Text Clustering "example".


>
>
>
> 2008/12/5 Grant Ingersoll <gsingers@apache.org>
>
>> I seem to recall some discussion a while back about being able to add
>> labels to the vectors/matrices, but I don't know the status of the  
>> patch.
>>
>> At any rate, very cool that you are using it for text clustering.   
>> I still
>> have on my list to write up how to do this and to write some  
>> supporting code
>> as well.  So, if either of you cares to contribute, that would be  
>> most
>> useful.
>>
>> -Grant
>>
>>
>> On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote:
>>
>> Hi Phillippe,
>>>
>>> I used the K-Means on TF-IDF vectors and wondered the same thing -  
>>> about
>>> labelling the documents. I haven't got my code on me at the moment  
>>> and it
>>> was a few months ago that I last looked at it (so I was also  
>>> probably
>>> using
>>> an older version of Mahout)... but I seem to remember that I did  
>>> just as
>>> you
>>> are suggesting and simply attached a unique ID to each document  
>>> which got
>>> passed through the map-reduce stages. This requires a bit of  
>>> tinkering
>>> with
>>> the K-Means implementation but shouldn't be too much work.
>>>
>>> As for having massive vectors, you could try representing them as  
>>> sparse
>>> vectors rather than the dense vectors the standard Mahout K-Means
>>> algorithm
>>> accepts, which gets rid of all the zero values in the document  
>>> vectors.
>>> See
>>> the Javadoc for details, it'll be more reliable than my memory :-)
>>>
>>> Richard
>>>
>>>
>>> 2008/12/3 Philippe Lamarche <philippe.lamarche@gmail.com>
>>>
>>> Hi,
>>>>
>>>> I have a questions concerning text clustering and the current
>>>> K-Means/vectors implementation.
>>>>
>>>> For a school project, I did some text clustering with a subset of  
>>>> the
>>>> Enron
>>>> corpus. I implemented a small M/R package that transforms text into
>>>> TF-IDF
>>>> vector space, and then I used a little modified version of the
>>>> syntheticcontrol K-Means example. So far, all is fine.
>>>>
>>>> However, the output of the k-mean algorithm is vector, as is the  
>>>> input.
>>>> As
>>>> I
>>>> understand it, when text is transformed in vector space, the  
>>>> cardinality
>>>> of
>>>> the vector is the number of word in your global dictionary, all  
>>>> word in
>>>> all
>>>> text being clustered. This, can grow up pretty quick. For  
>>>> example, with
>>>> only
>>>> 27000 Enron emails, even when removing word that only appears in  
>>>> 2 emails
>>>> or
>>>> less, the dictionary size is about 45000 words.
>>>>
>>>> My number one problem is this: how can we find out what document  
>>>> a vector
>>>> is
>>>> representing, when it comes out of the k-means algorithm? My  
>>>> favorite
>>>> solution would be to have a unique id attached to each vector. Is  
>>>> there
>>>> such
>>>> ID in the vector implementation? Is there a better solution? Is my
>>>> approach
>>>> to text clustering wrong?
>>>>
>>>> Thanks for the help,
>>>>
>>>> Philippe.
>>>>
>>>>
>> --------------------------
>> Grant Ingersoll
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











Mime
View raw message