Ted,
On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:
> A principled approach to cluster evaluation is to measure how well the
> cluster membership captures the structure of unseen data. A natural
> measure
> for this is to measure how much of the entropy of the data is
> captured by
> cluster membership. For kmeans and its natural L_2 metric, the
> natural
> cluster quality metric is the squared distance from the nearest
> centroid
> adjusted by the log_2 of the number of clusters. This can be
> compared to
> the squared magnitude of the original data or the squared deviation
> from the
> centroid for all of the data. The idea is that you are changing the
> representation of the data by allocating some of the bits in your
> original
> representation to represent which cluster each point is in. If
> those bits
> aren't made up by the residue being small then your clustering is
> making a
> bad tradeoff.
>
> In the past, I have used other more heuristic measures as well. One
> of the
> key characteristics that I would like to see out of a clustering is
> a degree
> of stability. Thus, I look at the fractions of points that are
> assigned to
> each cluster or the distribution of distances from the cluster
> centroid.
> These values should be relatively stable when applied to heldout
> data.
>
> For text, you can actually compute perplexity which measures how well
> cluster membership predicts what words are used. This is nice
> because you
> don't have to worry about the entropy of real valued numbers.
Do you have any references on any of the above approaches?
Thanks,
Grant
