mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Validating clustering output
Date Tue, 14 Jul 2009 13:41:09 GMT
Ted,

On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:

> A principled approach to cluster evaluation is to measure how well the
> cluster membership captures the structure of unseen data.  A natural  
> measure
> for this is to measure how much of the entropy of the data is  
> captured by
> cluster membership.  For k-means and its natural L_2 metric, the  
> natural
> cluster quality metric is the squared distance from the nearest  
> centroid
> adjusted by the log_2 of the number of clusters.  This can be  
> compared to
> the squared magnitude of the original data or the squared deviation  
> from the
> centroid for all of the data.  The idea is that you are changing the
> representation of the data by allocating some of the bits in your  
> original
> representation to represent which cluster each point is in.  If  
> those bits
> aren't made up by the residue being small then your clustering is  
> making a
> bad trade-off.
>
> In the past, I have used other more heuristic measures as well.  One  
> of the
> key characteristics that I would like to see out of a clustering is  
> a degree
> of stability.  Thus, I look at the fractions of points that are  
> assigned to
> each cluster or the distribution of distances from the cluster  
> centroid.
> These values should be relatively stable when applied to held-out  
> data.
>
> For text, you can actually compute perplexity which measures how well
> cluster membership predicts what words are used.  This is nice  
> because you
> don't have to worry about the entropy of real valued numbers.

Do you have any references on any of the above approaches?

Thanks,
Grant

Mime
View raw message