mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Document Clustering
Date Thu, 28 May 2009 15:28:32 GMT
It sounds like a start.  Can you open a JIRA and attach a patch?   I  
still am not sure if Lucene is totally the way to go on it.  I suppose  
eventually we need a way to put things in a common format like ARFF  
and then just have transformers to it from other formats.  Come to  
think of it, maybe it makes sense to have a Tika ContentHandler that  
can output ARFF or whatever other format we want.  This would make  
translating input docs dead simple.

Then again, maybe a real Pipeline is the answer.  I know Solr, etc.  
could benefit from one too, but that is a whole different ball of wax.


On May 28, 2009, at 10:32 AM, Shashikant Kore wrote:

> Hi Grant,
>
> I have the code to create lucene index from document text and then
> generate document vectors from it.  This is stand-alone code and not
> MR.  Is it something that interests you?
>
> --shashi
>
> On Thu, May 28, 2009 at 5:57 PM, Grant Ingersoll  
> <gsingers@apache.org> wrote:
>> I'm about to write some code to prepare docs for clustering and I  
>> know at
>> least a few others on the list here have done the same.  I was  
>> wondering if
>> anyone is in the position to share their code and contribute to  
>> Mahout.
>>
>> As I see it, we need to be able to take in text and create the  
>> matrix of
>> terms, where each cell is the TF/IDF (or some other weight, would  
>> be nice to
>> be pluggable) and then normalize the vector (and, according to Ted,  
>> we
>> should support using different norms).   Seems like we also need  
>> the label
>> stuff in place (https://issues.apache.org/jira/browse/MAHOUT-65)  
>> but I'm not
>> sure on the state of that patch.
>>
>> As for the TF/IDF stuff, we sort of have it via the  
>> BayesTfIdfDriver, but it
>> needs to be a more generic.  I realize we could use Lucene, but  
>> having a
>> solution that scales w/ Lucene is going to take work, AIUI, whereas  
>> a M/R
>> job seems more straightforward.
>>
>> I'd like to be able to get this stuff committed relatively soon and  
>> have the
>> examples for other people.  My shorter term goal is I'm working on  
>> some
>> demos using Wikipedia.
>>
>> Thanks,
>> Grant
>>
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message