# mahout-user mailing list archives

##### Site index · List index
Message view
Top
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Clustering a large crawl
Date Thu, 31 May 2012 22:42:00 GMT
```Yeah, that's the conclusion I was coming to but thought I'd ask the
experts. My dictionary is petty big. the last time I looked it was over
100,000 terms even with n-grams, lucene stop words, no numbers, and
stemming. I've tried Tanimoto too with similar results.

Dimensional reduction seems like the next thing to try.

-Pat

Further data from 150,000 docs. Using Canopy clustering I get these values
t1 = t2 = 0.3 => 123094 canopies
t1 = t2 = 0.6 => 97035 canopies
t1 = t2 = 0.9 => 60160 canopies
t1 = t2 = 0.91 => 59491 canopies
t1 = t2 = 0.93 => 58526 canopies
t1 = t2 = 0.95 => 57854 canopies
t1 = t2 = 0.97 => 57244 canopies
t1 = t2 = 0.99 => 56241 canopies

On 5/31/12 2:31 PM, Jeff Eastman wrote:
> And I misconstrued your earlier remarks on cluster size vs number of
> clusters. As t -> 1 you will get fewer and fewer canopies as you have
> observed. It actually doesn't seem like the cosine distance measure is
> working very well for you.
>
> Have you mentioned the size of your dictionary earlier? Perhaps
> increasing the number of stop words that are rejected will decrease
> the vector size and make clustering work better. This seems like the
> curse of dimensionality at work.
>
> On 5/31/12 11:18 AM, Pat Ferrel wrote:
>> Oops, misspoke. 0 good, 1 bad for clustering at least
>> For similarity 1 good 0 bad.
>>
>> One is a similarity value and the other a distance measure.
>>
>> But the primary question is how to get better canopies. I would
>> expect that as the distance t gets small the number of canopies gets
>> large which is what I see in the data below. Jeff suggests I try much
>> smaller t to get less canopies and I will though I don't understand
>> the logic. The docs are not all that similar. being from a general
>> news crawl.
>>
>> When using the CosineDistanceMeasure in Canopy on a corpus of 150,000
>> docs I get:
>>     t1 = t2 = 0.3 => 123094 canopies
>>     t1 = t2 = 0.6 => 97035 canopies
>>     t1 = t2 = 0.9 => 60160 canopies
>>
>> Obviously none of these values for t is very useful and it looks like
>> I need to make t even larger, which would seem to indicate very
>> loose/non-dense canopies, no? For very large ts are the canopies useful?
>>
>> I'm trying both but the other odd thing is that it takes longer to
>> run canopy on this data than to run kmeans, a lot longer.
>>
>> On 5/31/12 12:44 AM, Sean Owen wrote:
>>> On Thu, May 31, 2012 at 12:36 AM, Pat Ferrel<pat@occamsmachete.com>
>>> wrote:
>>>
>>>> I see
>>>>     double denominator = Math.sqrt(lengthSquaredp1) *
>>>> Math.sqrt(lengthSquaredp2);
>>>>     // correct for floating-point rounding errors
>>>>     if (denominator<  dotProduct) {
>>>>       denominator = dotProduct;
>>>>     }
>>>>     return 1.0 - dotProduct / denominator;
>>>>
>>>> So this is going to return 1 - cosine, right? So for clustering the
>>>> distance 1 = very close, 0 = very far.
>>>>
>>>>
>>> When two vectors are close, the angle between them is small, so the
>>> cosine
>>> is large, near 1. 0 = close, 1 = far, as expected.
>>>
>>
>>
>

```
Mime
View raw message