mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Judging the quality of clustering
Date Fri, 18 May 2012 16:26:56 GMT
Thanks Jeff. When I did my experiment it used kmeans for three runs k = 
10, 20, 10. Number of documents around 3000 (guessing here).

The k=10 run did not prune, k=30 pruned 4 clusters. I'll run this again 
to see if it is repeatable and you are welcome to the dataset.

I read that comment but was confused about the representative points. 
They appear to be collected by the RepresentativePointsDriver. The only 
input that looks relevant is an iteration number.  I'll try increasing 
that to see if the points are better chosen, I guess? Basically pruned 
clusters indicate that they are not part of the analysis, and I should 
do something to remedy the pruning.

I'd really like to get this working so if you have any suggestions for 
what to look at I'll give it a try. I have a tiny data set (16 small 
docs) I could use where you could probably calculate the CDbw by hand. k 
= 1, 2 maybe.

I'll poke around and see what I can find.

On 5/17/12 2:33 PM, Jeff Eastman wrote:
> Hi Pat,
> I don't have a good answer here. Evidently, something in CDbw has 
> become broken and you are the first to notice. When I run 
> TestCDbwEvaluator, the values for k-means and fuzzy-k are clearly 
> incorrect. The values for Canopy, MeanShift and Dirichlet are not so 
> obviously incorrect but I remain suspicious. Something must have 
> become broken in the recent clustering refactoring.
> From the method CDbwEvaluator.invalidCluster comment (used to enable 
> pruning):
>    * Return if the cluster is valid. Valid clusters must have more 
> than 2 representative points,
>    * and at least one of them must be different than the cluster 
> center. This is because the
>    * representative points extraction will duplicate the cluster 
> center if it is empty.
> Oddly enough, inspection of the test log indicates that only k-means 
> and fuzzy-k are not pruning clusters. Clearly some more investigation 
> is needed. I will take a look at it tomorrow. In the mean time if you 
> develop any additional insight please do share it with us.
> Thanks,
> Jeff
> On 5/17/12 3:53 PM, Pat Ferrel wrote:
>> I built a tool that iterates through a list of values for k on the 
>> same data and spits out the CDbw and ClusterEvaluator results each time.
>> When the evaluator or CDbw prunes a cluster, how do I interpret that? 
>> They seem to throw out the same clusters on a given run. Also CDbw 
>> always returns an inter-cluster density of 0?
>> On 5/17/12 5:58 AM, Jeff Eastman wrote:
>>> Yes, that is the paper I used to implement CDbw. I've tried it a few 
>>> times along with the simpler ClusterEvaluator metrics I took from 
>>> Mahout In Action and they look to be reasonable - see the tests - 
>>> though I have no way to judge their absolute values. Anything you 
>>> can contribute in this area would be most welcome. Perhaps a wiki page?
>>> On 5/16/12 1:14 PM, Pat Ferrel wrote:
>>>> The reference was in the code for 
>>>> On 5/16/12 9:56 AM, Pat Ferrel wrote:
>>>>> Thanks, I've been looking at that. Is there a description of how 
>>>>> to interpret those values? An academic paper maybe? The 
>>>>> intra-cluster distance intuitively seems to correspond to 
>>>>> something like cohesion. I don't get the intuition behind 
>>>>> inter-cluster distances but Ted thinks they are the most important.
>>>>> On 5/16/12 7:32 AM, Jeff Eastman wrote:
>>>>>> Mahout has a ClusterEvaluator and a CDbwEvaluator that compute 
>>>>>> some quality metrics (inter-cluster distance, 
>>>>>> intra-cluster-distance, ...) that you may find useful. Both 
>>>>>> calculate a set of representative points from the clustering 
>>>>>> output and compute the (n^2) metrics over these points rather 
>>>>>> than all of the points in each cluster.
>>>>>> On 5/15/12 4:46 PM, Pat Ferrel wrote:
>>>>>>> So many questions about best k, how to choose t1 and t2, how

>>>>>>> much help is dimensional reduction would have clear answers if

>>>>>>> we had a way to judge the quality of clusters.
>>>>>>> Various methods were discussed here for a time: 
>>>>>>> Has there been any work on building a measure of quality?

View raw message