# mahout-user mailing list archives

##### Site index · List index
Message view
Top
From Pat Ferrel <...@farfetchers.com>
Subject Re: Clustering a large crawl
Date Thu, 31 May 2012 15:18:45 GMT
```Oops, misspoke. 0 good, 1 bad for clustering at least
For similarity 1 good 0 bad.

One is a similarity value and the other a distance measure.

But the primary question is how to get better canopies. I would expect
that as the distance t gets small the number of canopies gets large
which is what I see in the data below. Jeff suggests I try much smaller
t to get less canopies and I will though I don't understand the logic.
The docs are not all that similar. being from a general news crawl.

When using the CosineDistanceMeasure in Canopy on a corpus of 150,000
docs I get:
t1 = t2 = 0.3 => 123094 canopies
t1 = t2 = 0.6 => 97035 canopies
t1 = t2 = 0.9 => 60160 canopies

Obviously none of these values for t is very useful and it looks like I
need to make t even larger, which would seem to indicate very
loose/non-dense canopies, no? For very large ts are the canopies useful?

I'm trying both but the other odd thing is that it takes longer to run
canopy on this data than to run kmeans, a lot longer.

On 5/31/12 12:44 AM, Sean Owen wrote:
> On Thu, May 31, 2012 at 12:36 AM, Pat Ferrel<pat@occamsmachete.com>  wrote:
>
>> I see
>>     double denominator = Math.sqrt(lengthSquaredp1) *
>> Math.sqrt(lengthSquaredp2);
>>     // correct for floating-point rounding errors
>>     if (denominator<  dotProduct) {
>>       denominator = dotProduct;
>>     }
>>     return 1.0 - dotProduct / denominator;
>>
>> So this is going to return 1 - cosine, right? So for clustering the
>> distance 1 = very close, 0 = very far.
>>
>>
> When two vectors are close, the angle between them is small, so the cosine
> is large, near 1. 0 = close, 1 = far, as expected.
>

```
Mime
View raw message