mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Document Clustering
Date Thu, 28 May 2009 13:07:35 GMT
I did an initial cut at MAHOUT-65 but was blocked by the 
serialization/deserialization needed to fully address the requirements. 
Now that Gson is in lib it makes sense to use it. The Dirichlet package 
already has a JsonVectorAdapter which could be rewritten. It is a big 
change to the on-disk format but most jobs have an initial step to 
consume e.g. csv files so changing it should not break much 
compatibility. I will take another crack at it.

Jeff

Grant Ingersoll wrote:
> I'm about to write some code to prepare docs for clustering and I know 
> at least a few others on the list here have done the same.  I was 
> wondering if anyone is in the position to share their code and 
> contribute to Mahout.
>
> As I see it, we need to be able to take in text and create the matrix 
> of terms, where each cell is the TF/IDF (or some other weight, would 
> be nice to be pluggable) and then normalize the vector (and, according 
> to Ted, we should support using different norms).   Seems like we also 
> need the label stuff in place 
> (https://issues.apache.org/jira/browse/MAHOUT-65) but I'm not sure on 
> the state of that patch.
>
> As for the TF/IDF stuff, we sort of have it via the BayesTfIdfDriver, 
> but it needs to be a more generic.  I realize we could use Lucene, but 
> having a solution that scales w/ Lucene is going to take work, AIUI, 
> whereas a M/R job seems more straightforward.
>
> I'd like to be able to get this stuff committed relatively soon and 
> have the examples for other people.  My shorter term goal is I'm 
> working on some demos using Wikipedia.
>
> Thanks,
> Grant
>
>
>
>


Mime
View raw message