mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: choosing appropriate t1,t2 for canopy clustering
Date Wed, 16 May 2012 14:34:47 GMT
You can use the RepresentativePointsDriver to pick a set of n 
representative points from each cluster to speed these calculations, but 
it requires the clusters and clustered points so it may not help with 
what you are doing.

On 5/16/12 4:16 AM, Paritosh Ranjan wrote:
> "calculated the mean distance between all the pairs of vectors"
>
>
> This can be a very costly operation if the dataset is reasonably large.
>
> On 16-05-2012 13:34, ivan obeso wrote:
>> In my project of text clustering I used the Euclidean distance as
>> measurement method. I wrote a method which calculated the mean distance
>> between all the pairs of vectors (documents) and used this mean as 
>> T2, and
>> for T1 I used mean*2. This approach worked really good for me, giving
>> a reasonably
>> number of clusters in various corpus.
>>
>> On Tue, May 15, 2012 at 10:45 AM, Robert 
>> Stewart<bstewart.ny@gmail.com>wrote:
>>
>>> I am trying to run canopy clustering on vectors extracted from lucene
>>> index.  I want to use CosineDistanceMeasure.  How do I know what
>>> appropriate values to use for t1 and t2 distance threshold?  I would 
>>> assume
>>> that Cosine distance measure would return "distances" as a range 
>>> from 0.0
>>> to 1.0 but that seems not the case, so how do I know what the potential
>>> distance ranges are to pick t1 and t2 (other than many trial and 
>>> errors)?
>>>
>>> Thanks
>>> Bob
>
>
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message