mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <suneel_mar...@yahoo.com>
Subject Re: Clustering a large crawl
Date Thu, 31 May 2012 23:20:20 GMT
Pat,

We have been trying to do something very similar to what u r trying to accomplish and we ended
up with better clusters by considering only the top 1000 terms (by tf-idf weight) per doc
and using Tanimoto distance.  


Definitely give dimensionality reduction a try and let us know how it works out. 



________________________________
 From: Pat Ferrel <pat@occamsmachete.com>
To: user@mahout.apache.org 
Sent: Thursday, May 31, 2012 6:42 PM
Subject: Re: Clustering a large crawl
 

Yeah, that's the conclusion I was coming to but thought I'd ask the experts. My dictionary
is petty big. the last time I looked it was over 100,000 terms even with n-grams, lucene stop
words, no numbers, and stemming. I've tried Tanimoto too with similar results. 

Dimensional reduction seems like the next thing to try.

-Pat


Further data from 150,000 docs. Using Canopy clustering I get these
    values
    t1 = t2 = 0.3 => 123094 canopies 
    t1 = t2 = 0.6 => 97035 canopies 
    t1 = t2 = 0.9 => 60160 canopies 
    t1 = t2 = 0.91 => 59491 canopies 
    t1 = t2 = 0.93 => 58526 canopies 
    t1 = t2 = 0.95 => 57854 canopies 
    t1 = t2 = 0.97 => 57244 canopies 
    t1 = t2 = 0.99 => 56241 canopies 



On 5/31/12 2:31 PM, Jeff Eastman wrote: 
And I misconstrued your earlier remarks on cluster size vs number of clusters. As t ->
1 you will get fewer and fewer canopies as you have observed. It actually doesn't seem like
the cosine distance measure is working very well for you. 
>
>Have you mentioned the size of your dictionary earlier? Perhaps
      increasing the number of stop words that are rejected will
      decrease the vector size and make clustering work better. This
      seems like the curse of dimensionality at work. 
>
>On 5/31/12 11:18 AM, Pat Ferrel wrote: 
>
>Oops, misspoke. 0 good, 1 bad for clustering at least 
>>For similarity 1 good 0 bad. 
>>
>>One is a similarity value and the other a distance measure. 
>>
>>But the primary question is how to get better canopies. I would
        expect that as the distance t gets small the number of canopies
        gets large which is what I see in the data below. Jeff suggests
        I try much smaller t to get less canopies and I will though I
        don't understand the logic. The docs are not all that similar.
        being from a general news crawl. 
>>
>>When using the CosineDistanceMeasure in Canopy on a corpus of
        150,000 docs I get: 
>>    t1 = t2 = 0.3 => 123094 canopies 
>>    t1 = t2 = 0.6 => 97035 canopies 
>>    t1 = t2 = 0.9 => 60160 canopies 
>>
>>Obviously none of these values for t is very useful and it looks
        like I need to make t even larger, which would seem to indicate
        very loose/non-dense canopies, no? For very large ts are the
        canopies useful? 
>>
>>I'm trying both but the other odd thing is that it takes longer
        to run canopy on this data than to run kmeans, a lot longer. 
>>
>>On 5/31/12 12:44 AM, Sean Owen wrote: 
>>
>>On Thu, May 31, 2012 at 12:36 AM, Pat Ferrel<pat@occamsmachete.com>  wrote:

>>>
>>>
>>>I see 
>>>>    double denominator = Math.sqrt(lengthSquaredp1) * 
>>>>Math.sqrt(lengthSquaredp2); 
>>>>    // correct for floating-point rounding errors 
>>>>    if (denominator<  dotProduct) { 
>>>>      denominator = dotProduct; 
>>>>    } 
>>>>    return 1.0 - dotProduct / denominator; 
>>>>
>>>>So this is going to return 1 - cosine, right? So for
            clustering the 
>>>>distance 1 = very close, 0 = very far. 
>>>>
>>>>
>>>>
When two vectors are close, the angle between them is small, so the cosine 
>>>is large, near 1. 0 = close, 1 = far, as expected. 
>>>
>>>
>>
>>
>
Mime
View raw message