mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: choosing appropriate t1,t2 for canopy clustering
Date Tue, 15 May 2012 15:16:47 GMT
Hi Bob,

Cosine distance will return distances on 0.0...1.0 as you suggest. While 
there is no absolutely foolproof technique for priming canopy T1 & T2 
values I recommend you begin by setting T1==T2 and doing a binary search 
from some initial distance, perhaps 0.1. If you get too few clusters, 
decrease T1==T2 by half and try again. If too many, double etc.

If you want to be more analytical, use the RandomSeedGenerator to sample 
from your input vectors and compute a starting point using their 
inter-cluster distances. You can also skip Canopy and use k-means with 
-k specified to sample from your input data and produce k clusters. That 
works pretty well with text and Cosine distance

Once you arrive at a "reasonable" number of clusters, you can mess with 
T1 to include more points in the centroid calculations but that will not 
change the number of clusters.


On 5/15/12 10:45 AM, Robert Stewart wrote:
> I am trying to run canopy clustering on vectors extracted from lucene index.  I want
to use CosineDistanceMeasure.  How do I know what appropriate values to use for t1 and t2
distance threshold?  I would assume that Cosine distance measure would return "distances"
as a range from 0.0 to 1.0 but that seems not the case, so how do I know what the potential
distance ranges are to pick t1 and t2 (other than many trial and errors)?
>
> Thanks
> Bob
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message