mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Judging the quality of clustering
Date Thu, 17 May 2012 19:53:28 GMT
I built a tool that iterates through a list of values for k on the same 
data and spits out the CDbw and ClusterEvaluator results each time.

When the evaluator or CDbw prunes a cluster, how do I interpret that? 
They seem to throw out the same clusters on a given run. Also CDbw 
always returns an inter-cluster density of 0?

On 5/17/12 5:58 AM, Jeff Eastman wrote:
> Yes, that is the paper I used to implement CDbw. I've tried it a few 
> times along with the simpler ClusterEvaluator metrics I took from 
> Mahout In Action and they look to be reasonable - see the tests - 
> though I have no way to judge their absolute values. Anything you can 
> contribute in this area would be most welcome. Perhaps a wiki page?
>
>
> On 5/16/12 1:14 PM, Pat Ferrel wrote:
>> The reference was in the code for 
>> http://www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf
>>
>> On 5/16/12 9:56 AM, Pat Ferrel wrote:
>>> Thanks, I've been looking at that. Is there a description of how to 
>>> interpret those values? An academic paper maybe? The intra-cluster 
>>> distance intuitively seems to correspond to something like cohesion. 
>>> I don't get the intuition behind inter-cluster distances but Ted 
>>> thinks they are the most important.
>>>
>>> On 5/16/12 7:32 AM, Jeff Eastman wrote:
>>>> Mahout has a ClusterEvaluator and a CDbwEvaluator that compute some 
>>>> quality metrics (inter-cluster distance, intra-cluster-distance, 
>>>> ...) that you may find useful. Both calculate a set of 
>>>> representative points from the clustering output and compute the 
>>>> (n^2) metrics over these points rather than all of the points in 
>>>> each cluster.
>>>>
>>>> On 5/15/12 4:46 PM, Pat Ferrel wrote:
>>>>> So many questions about best k, how to choose t1 and t2, how much 
>>>>> help is dimensional reduction would have clear answers if we had a 
>>>>> way to judge the quality of clusters.
>>>>>
>>>>> Various methods were discussed here for a time: 
>>>>> http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
>>>>>
>>>>> Has there been any work on building a measure of quality?
>>>>>
>>>>>
>>>>
>>
>>
>

Mime
View raw message