mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Validating clustering output
Date Wed, 17 Jun 2009 13:32:57 GMT

On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:

> A principled approach to cluster evaluation is to measure how well the
> cluster membership captures the structure of unseen data.  A natural  
> measure
> for this is to measure how much of the entropy of the data is  
> captured by
> cluster membership.  For k-means and its natural L_2 metric, the  
> natural
> cluster quality metric is the squared distance from the nearest  
> centroid
> adjusted by the log_2 of the number of clusters.  This can be  
> compared to
> the squared magnitude of the original data or the squared deviation  
> from the
> centroid for all of the data.  The idea is that you are changing the
> representation of the data by allocating some of the bits in your  
> original
> representation to represent which cluster each point is in.  If  
> those bits
> aren't made up by the residue being small then your clustering is  
> making a
> bad trade-off.
> In the past, I have used other more heuristic measures as well.  One  
> of the
> key characteristics that I would like to see out of a clustering is  
> a degree
> of stability.  Thus, I look at the fractions of points that are  
> assigned to
> each cluster or the distribution of distances from the cluster  
> centroid.
> These values should be relatively stable when applied to held-out  
> data.
> For text, you can actually compute perplexity which measures how well
> cluster membership predicts what words are used.  This is nice  
> because you
> don't have to worry about the entropy of real valued numbers.

OK, so how do we go about codifying this stuff?  Is there existing  
code that we could use or is it worth us writing our own?

Some references would be good here, too.  Feel free to add to

.  (I've already linked this conversation, but will probably cut and  
paste some of it too.

> Manual inspection and the so-called laugh test is also important.   
> The idea
> is that the results should not be so ludicrous as to make you laugh.
> Unfortunately, it is pretty easy to kid yourself into thinking your  
> system
> is working using this kind of inspection.  The problem is that we  
> are too
> good at seeing (making up) patterns.

I think this is where the new Open Relevance Project can come in,  
too.  Judgments, etc. ain't just for search!

> On Tue, Jun 16, 2009 at 2:35 PM, Grant Ingersoll  
> <>wrote:
>> What tools/approaches are people using to validate their clustering  
>> output?
>> Are there utilities that we should be implementing that would make  
>> this
>> easier for users?

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

View raw message