mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Document Clustering
Date Wed, 08 Jul 2009 12:07:19 GMT

On Jul 7, 2009, at 6:10 PM, Ted Dunning wrote:

> Paul,
>
> You are doing a fine job so far.  Typically, I would recommend you  
> stick
> with the Document x Term matrix formulation.  That form would allow  
> you to
> use kmeans almost directly.
>
> The good news for you is that Grant is working on an extended  
> example to do
> just what you are looking to do.  I expect he will comment shortly.

On vacation...  I will try to find some time soon.

>
> On Tue, Jul 7, 2009 at 2:37 PM, Paul Ingles <paul@oobaloo.co.uk>  
> wrote:
>
>> Hi,
>>
>> I've been doing some reading through the archives to search for some
>> inspiration with a problem I've been attempting to solve at work,  
>> and was
>> hoping I could share where my head's at and get some pointers for  
>> where to
>> go. Since we're looking at clustering between 17m and 70m  
>> documents, we're
>> looking to implement this in Hadoop.
>>
>> We're trying to build clusters of (relatively small) documents,  
>> based on
>> the words/terms they contain. Clusters will probably range in size  
>> between
>> 10 and 1000 documents. Clusters should ultimately be populated by  
>> documents
>> that contain almost identical terms (nothing clever like stemming  
>> done so
>> far).
>>
>> So far, I've been working down the pairwise similarity route. So,  
>> using a
>> few MapReduce jobs we produce something along the lines of the  
>> following:
>>
>> Da: [Db,0.1] [Dc,0.4]
>> Db: [Dc,0.5] [Df,0.9]
>> ...
>> Dj:
>>
>> With a row per-document, containing a vector of tuples for related
>> documents, and a similarity score. Viewed another way, it's the  
>> typical
>> matrix:
>>
>>              A        B       C
>> -----+------+------+-----+
>> A    |          |   0.1 |   0.4
>> B    |          |          |   0.5
>>
>> etc. A higher number means a more closely related document.
>>
>> I've been trying to get my head around how to cluster these in a  
>> set of
>> MapReduce jobs and I'm not quite sure of how to proceed: the  
>> examples I've
>> read around kmeans, canopy clustering etc. all seem to work on
>> multidimensional (numerical) data. Given the data above, is it even  
>> possible
>> to adapt the algorithm? The potential centroids in the example  
>> above would
>> be the documents, and I just can't get my head around applying the
>> algorithms to this kind of data.
>>
>> I guess the alternative would be to step back to producing a matrix  
>> of
>> terms x documents:
>>
>>           Ta     Tb     Tc
>> -----+------+------+-----+
>> Da  |          |   0.1 |   0.4
>> Db  |          |          |   0.5
>>
>> And then cluster based on this? This seems similar in structure to  
>> the user
>> x movie recommendation matrix that's often used?
>>
>> I'd really appreciate people's thoughts- am I thinking the right  
>> way? Is
>> there a better way?
>>
>> Thanks,
>> Paul
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message