mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Document Clustering
Date Thu, 28 May 2009 19:21:36 GMT
Isn't this what Mahout's clustering stuff will do?  In other words, if  
I calculate the vector for each document (presumably removing  
stopwords), normalize it, where each cell is the weight (presumably TF/ 
IDF) and then put that into a matrix (keeping track of labels), I  
should then be able to just run any of Mahout's clustering jobs on  
that matrix using the appropriate DistanceMeasure implementation,  
right?  Or am I missing something?

On May 28, 2009, at 11:55 AM, Ted Dunning wrote:

> Generally the first step for document clustering is to compute all
> non-trivial document-document similarities.  A good way to do that  
> is to
> strip out kill words from all documents and then do a document level
> cross-occurence.  In database terms, if we think of documents as  
> docid, term
> pairs, this step consists of joining this document table to itself  
> to get
> document-document pairs for all documents that share terms.  In  
> detail,
> starting with a term weight table and a document table:
>
>     - join term weight to document table to get (docid, term, weight)*
>
>     - optionally normalize term weights per document by summing  
> weights or
> squared weights by docid and joining back to the weighted document  
> table.
>
>     - join result to itself dropping terms and reducing on docid to  
> sum
> weights.  This gives  (docid1, docid2, sum_of_weights,
> number_of_occurrences).  This sum can be weights or squared weights.
> Accumulating the number of coocurrences helps in computing the  
> average.
>
>
> From here, there are a number of places to go, but the result we  
> have here
> is essentially a sparse similarity matrix.  If you have document
> normalization, then document similarity can be converted to distance
> trivially.
>
> On Thu, May 28, 2009 at 8:28 AM, Grant Ingersoll  
> <gsingers@apache.org>wrote:
>
>> It sounds like a start.  Can you open a JIRA and attach a patch?    
>> I still
>> am not sure if Lucene is totally the way to go on it.  I suppose  
>> eventually
>> we need a way to put things in a common format like ARFF and then  
>> just have
>> transformers to it from other formats.  Come to think of it, maybe  
>> it makes
>> sense to have a Tika ContentHandler that can output ARFF or  
>> whatever other
>> format we want.  This would make translating input docs dead simple.
>>
>> Then again, maybe a real Pipeline is the answer.  I know Solr, etc.  
>> could
>> benefit from one too, but that is a whole different ball of wax.
>>
>>
>>
>> On May 28, 2009, at 10:32 AM, Shashikant Kore wrote:
>>
>> Hi Grant,
>>>
>>> I have the code to create lucene index from document text and then
>>> generate document vectors from it.  This is stand-alone code and not
>>> MR.  Is it something that interests you?
>>>
>>> --shashi
>>>
>>> On Thu, May 28, 2009 at 5:57 PM, Grant Ingersoll <gsingers@apache.org 
>>> >
>>> wrote:
>>>
>>>> I'm about to write some code to prepare docs for clustering and I  
>>>> know at
>>>> least a few others on the list here have done the same.  I was  
>>>> wondering
>>>> if
>>>> anyone is in the position to share their code and contribute to  
>>>> Mahout.
>>>>
>>>> As I see it, we need to be able to take in text and create the  
>>>> matrix of
>>>> terms, where each cell is the TF/IDF (or some other weight, would  
>>>> be nice
>>>> to
>>>> be pluggable) and then normalize the vector (and, according to  
>>>> Ted, we
>>>> should support using different norms).   Seems like we also need  
>>>> the
>>>> label
>>>> stuff in place (https://issues.apache.org/jira/browse/MAHOUT-65)  
>>>> but I'm
>>>> not
>>>> sure on the state of that patch.
>>>>
>>>> As for the TF/IDF stuff, we sort of have it via the  
>>>> BayesTfIdfDriver, but
>>>> it
>>>> needs to be a more generic.  I realize we could use Lucene, but  
>>>> having a
>>>> solution that scales w/ Lucene is going to take work, AIUI,  
>>>> whereas a M/R
>>>> job seems more straightforward.
>>>>
>>>> I'd like to be able to get this stuff committed relatively soon  
>>>> and have
>>>> the
>>>> examples for other people.  My shorter term goal is I'm working  
>>>> on some
>>>> demos using Wikipedia.
>>>>
>>>> Thanks,
>>>> Grant
>>>>

Mime
View raw message