mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Clustering a large crawl
Date Wed, 30 May 2012 23:36:26 GMT
I see
     double denominator = Math.sqrt(lengthSquaredp1) * 
Math.sqrt(lengthSquaredp2);
     // correct for floating-point rounding errors
     if (denominator < dotProduct) {
       denominator = dotProduct;
     }
     return 1.0 - dotProduct / denominator;

So this is going to return 1 - cosine, right? So for clustering the 
distance 1 = very close, 0 = very far.

When using the CosineDistanceMeasure in Canopy on a corpus of 150,000 
docs I get:
     t1 = t2 = 0.3 => 123094 canopies
     t1 = t2 = 0.6 => 97035 canopies
     t1 = t2 = 0.9 => 60160 canopies

The number of canopies seems to go down as t goes up so I assumed t was 
actually a cosine in this case. I'd expect to get 150,000 canopies with 
smaller t values. I have double checked and I did indeed use 
org.apache.mahout.common.distance.CosineDistanceMeasure.


On 5/30/12 1:26 PM, Jeff Eastman wrote:
> The CosineDistanceMeasure returns 1 - dotProduct / denominator so it 
> is returning the value you note. If the documents are very similar, 
> then their distance will be small and t=0.1 could be too large to 
> distinguish anything but the gross differences between the documents 
> in the corpus. I'd try dropping the t-value until I get at least 
> 50-100 clusters but I have no idea how small that might be.
>
>
> On 5/30/12 4:11 PM, Robert Stewart wrote:
>> That is a good point.   t1/t2 are distance measures but cosine is a 
>> similarity measure, so you need to think of it as 1-cosine.
>>
>>
>>
>> On May 30, 2012, at 4:03 PM, Jeff Eastman wrote:
>>
>>> Have you tried much smaller values for t1=t2? Recall that the 
>>> t-values specify the distance within which a new point is assigned 
>>> to an existing canopy. In the limit as t ->  0, you should get n 
>>> clusters, where n is the number of documents in your corpus.
>>>
>>> On 5/30/12 1:23 PM, Pat Ferrel wrote:
>>>> I have about 150,000 docs on which I ran canopy with values for t1 
>>>> = t2 from 0.1 to 0.95 using the Cosine distance measure. I got 
>>>> results that range from 1.5 docs per cluster to 3. In other words 
>>>> canopy produced a very large number of centroids, which does not 
>>>> seem to represent the data very well. Trying random values for k 
>>>> seems to produce better results but still spotty and hard to judge. 
>>>> I am at the point of giving up on canopy and so wrote a utility to 
>>>> simply iterate k over some values and run the evaluators each time, 
>>>> but there are currently some problems with CDbw (Inter-Cluster 
>>>> Density is always 0.0 for instance).
>>>>
>>>> This seems like such a fundamental problem that others must have 
>>>> found a way to get better results. Any suggestions?
>>>>
>>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message