mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: choosing appropriate t1,t2 for canopy clustering
Date Tue, 15 May 2012 19:28:12 GMT
I've seen the same thing. I don't think there is a way to specify ahead 
of time acceptable distances because the algorithm finds centroids and 
can only calculate distances after all centroids have converged. Which 
might be your answer because these distances are stored at the end of 
the job. In my case I only allow the closest docs to the centroid into 
my UI. You could set your own threshold and discard docs too far away. 
In some cases you may then get empty clusters.

On 5/15/12 8:36 AM, Robert Stewart wrote:
> Thanks Jeff.  I do see that cosine distance does return 0.0-1.0 now as expected.  Something
else was wrong in my initial run I guess.
>
> A different question about k-means:  I can successfully cluster using k-means but what
happens is some clusters are very unrelated, so it seems like there needs to be some distance
threshold to cluster documents using k-means (so clusters with very dis-similar items just
dont get put into any cluster).  Is that possible with mahout?  I dont see any type of threshold
parameters for k-means.
>
>
> On May 15, 2012, at 11:16 AM, Jeff Eastman wrote:
>
>> Hi Bob,
>>
>> Cosine distance will return distances on 0.0...1.0 as you suggest. While there is
no absolutely foolproof technique for priming canopy T1&  T2 values I recommend you begin
by setting T1==T2 and doing a binary search from some initial distance, perhaps 0.1. If you
get too few clusters, decrease T1==T2 by half and try again. If too many, double etc.
>>
>> If you want to be more analytical, use the RandomSeedGenerator to sample from your
input vectors and compute a starting point using their inter-cluster distances. You can also
skip Canopy and use k-means with -k specified to sample from your input data and produce k
clusters. That works pretty well with text and Cosine distance
>>
>> Once you arrive at a "reasonable" number of clusters, you can mess with T1 to include
more points in the centroid calculations but that will not change the number of clusters.
>>
>>
>> On 5/15/12 10:45 AM, Robert Stewart wrote:
>>> I am trying to run canopy clustering on vectors extracted from lucene index.
 I want to use CosineDistanceMeasure.  How do I know what appropriate values to use for t1
and t2 distance threshold?  I would assume that Cosine distance measure would return "distances"
as a range from 0.0 to 1.0 but that seems not the case, so how do I know what the potential
distance ranges are to pick t1 and t2 (other than many trial and errors)?
>>>
>>> Thanks
>>> Bob
>>>
>
>

Mime
View raw message