mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Document Clustering
Date Thu, 28 May 2009 15:55:30 GMT
Generally the first step for document clustering is to compute all
non-trivial document-document similarities.  A good way to do that is to
strip out kill words from all documents and then do a document level
cross-occurence.  In database terms, if we think of documents as docid, term
pairs, this step consists of joining this document table to itself to get
document-document pairs for all documents that share terms.  In detail,
starting with a term weight table and a document table:

     - join term weight to document table to get (docid, term, weight)*

     - optionally normalize term weights per document by summing weights or
squared weights by docid and joining back to the weighted document table.

     - join result to itself dropping terms and reducing on docid to sum
weights.  This gives  (docid1, docid2, sum_of_weights,
number_of_occurrences).  This sum can be weights or squared weights.
Accumulating the number of coocurrences helps in computing the average.


>From here, there are a number of places to go, but the result we have here
is essentially a sparse similarity matrix.  If you have document
normalization, then document similarity can be converted to distance
trivially.

On Thu, May 28, 2009 at 8:28 AM, Grant Ingersoll <gsingers@apache.org>wrote:

> It sounds like a start.  Can you open a JIRA and attach a patch?   I still
> am not sure if Lucene is totally the way to go on it.  I suppose eventually
> we need a way to put things in a common format like ARFF and then just have
> transformers to it from other formats.  Come to think of it, maybe it makes
> sense to have a Tika ContentHandler that can output ARFF or whatever other
> format we want.  This would make translating input docs dead simple.
>
> Then again, maybe a real Pipeline is the answer.  I know Solr, etc. could
> benefit from one too, but that is a whole different ball of wax.
>
>
>
> On May 28, 2009, at 10:32 AM, Shashikant Kore wrote:
>
>  Hi Grant,
>>
>> I have the code to create lucene index from document text and then
>> generate document vectors from it.  This is stand-alone code and not
>> MR.  Is it something that interests you?
>>
>> --shashi
>>
>> On Thu, May 28, 2009 at 5:57 PM, Grant Ingersoll <gsingers@apache.org>
>> wrote:
>>
>>> I'm about to write some code to prepare docs for clustering and I know at
>>> least a few others on the list here have done the same.  I was wondering
>>> if
>>> anyone is in the position to share their code and contribute to Mahout.
>>>
>>> As I see it, we need to be able to take in text and create the matrix of
>>> terms, where each cell is the TF/IDF (or some other weight, would be nice
>>> to
>>> be pluggable) and then normalize the vector (and, according to Ted, we
>>> should support using different norms).   Seems like we also need the
>>> label
>>> stuff in place (https://issues.apache.org/jira/browse/MAHOUT-65) but I'm
>>> not
>>> sure on the state of that patch.
>>>
>>> As for the TF/IDF stuff, we sort of have it via the BayesTfIdfDriver, but
>>> it
>>> needs to be a more generic.  I realize we could use Lucene, but having a
>>> solution that scales w/ Lucene is going to take work, AIUI, whereas a M/R
>>> job seems more straightforward.
>>>
>>> I'd like to be able to get this stuff committed relatively soon and have
>>> the
>>> examples for other people.  My shorter term goal is I'm working on some
>>> demos using Wikipedia.
>>>
>>> Thanks,
>>> Grant
>>>
>>>
>>>
>>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message