mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <shashik...@gmail.com>
Subject Re: Document Clustering
Date Thu, 28 May 2009 14:32:55 GMT
Hi Grant,

I have the code to create lucene index from document text and then
generate document vectors from it.  This is stand-alone code and not
MR.  Is it something that interests you?

--shashi

On Thu, May 28, 2009 at 5:57 PM, Grant Ingersoll <gsingers@apache.org> wrote:
> I'm about to write some code to prepare docs for clustering and I know at
> least a few others on the list here have done the same.  I was wondering if
> anyone is in the position to share their code and contribute to Mahout.
>
> As I see it, we need to be able to take in text and create the matrix of
> terms, where each cell is the TF/IDF (or some other weight, would be nice to
> be pluggable) and then normalize the vector (and, according to Ted, we
> should support using different norms).   Seems like we also need the label
> stuff in place (https://issues.apache.org/jira/browse/MAHOUT-65) but I'm not
> sure on the state of that patch.
>
> As for the TF/IDF stuff, we sort of have it via the BayesTfIdfDriver, but it
> needs to be a more generic.  I realize we could use Lucene, but having a
> solution that scales w/ Lucene is going to take work, AIUI, whereas a M/R
> job seems more straightforward.
>
> I'd like to be able to get this stuff committed relatively soon and have the
> examples for other people.  My shorter term goal is I'm working on some
> demos using Wikipedia.
>
> Thanks,
> Grant
>
>
>

Mime
View raw message