mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Document Clustering
Date Sat, 13 Jun 2009 12:43:44 GMT
Hi Shashi,

Was wondering what you thought of my updates to MAHOUT-126?


On May 28, 2009, at 10:32 AM, Shashikant Kore wrote:

> Hi Grant,
> I have the code to create lucene index from document text and then
> generate document vectors from it.  This is stand-alone code and not
> MR.  Is it something that interests you?
> --shashi
> On Thu, May 28, 2009 at 5:57 PM, Grant Ingersoll  
> <> wrote:
>> I'm about to write some code to prepare docs for clustering and I  
>> know at
>> least a few others on the list here have done the same.  I was  
>> wondering if
>> anyone is in the position to share their code and contribute to  
>> Mahout.
>> As I see it, we need to be able to take in text and create the  
>> matrix of
>> terms, where each cell is the TF/IDF (or some other weight, would  
>> be nice to
>> be pluggable) and then normalize the vector (and, according to Ted,  
>> we
>> should support using different norms).   Seems like we also need  
>> the label
>> stuff in place (  
>> but I'm not
>> sure on the state of that patch.
>> As for the TF/IDF stuff, we sort of have it via the  
>> BayesTfIdfDriver, but it
>> needs to be a more generic.  I realize we could use Lucene, but  
>> having a
>> solution that scales w/ Lucene is going to take work, AIUI, whereas  
>> a M/R
>> job seems more straightforward.
>> I'd like to be able to get this stuff committed relatively soon and  
>> have the
>> examples for other people.  My shorter term goal is I'm working on  
>> some
>> demos using Wikipedia.
>> Thanks,
>> Grant

View raw message