Brown and DiPietro's algorithm for clustering based on entropy is
somewhat infamous for the difficulty of achieving usable performance.
Mike Collins was responsible for a famously speedy version. Having
build one that is just barely fast enough in C++, I wouldn't recommend
trying it in Java. Of course, you aren't proposing that, just
recommending the bigram entropy metric or something like it.
On Mon, Jul 27, 2009 at 9:42 PM, Ted Dunning<ted.dunning@gmail.com> wrote:
> (vastly delayed response ... huge distractions competing with more than 2
> minutes answers are to blame)
>
> Grant,
>
> For evaluating clustering for symbol sequences:
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.7275
>
> Most of the other references I have found talk about quality relative to
> gold standard judgments about whether exemplars are in the same class or
> relative to similarity/distinctiveness ratios. Neither is all that
> satisfactory.
>
> My preference is an entropic measure that describes how much of the
> information in your data is captured by the clustering vs how much residual
> info there is.
>
> The other reference I am looking for may be in David Mackay's book. The
> idea is that you measure the quality of the approximation by looking at the
> entropy in the cluster assignment relative to the residual required to
> precisely specify the original data relative to the quantized value.
>
> This is also related to trading off signal/noise in a vector quantizer.
>
> David, do you have a moment to talk about this with me? I can't free up
> the time to chase these final references and come up with a nice formula for
> this. I think you could do it in 1020 minutes.
>
> On Tue, Jul 14, 2009 at 6:41 AM, Grant Ingersoll <gsingers@apache.org>wrote:
>
>> On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:
>>
>> A principled approach to cluster evaluation is to measure how well the
>>> cluster membership captures the structure of unseen data. A natural
>>> measure
>>> for this is to measure how much of the entropy of the data is captured by
>>> cluster membership. For kmeans and its natural L_2 metric, the natural
>>> cluster quality metric is the squared distance from the nearest centroid
>>> adjusted by the log_2 of the number of clusters. This can be compared to
>>> the squared magnitude of the original data or the squared deviation from
>>> the
>>> centroid for all of the data. The idea is that you are changing the
>>> representation of the data by allocating some of the bits in your original
>>> representation to represent which cluster each point is in. If those bits
>>> aren't made up by the residue being small then your clustering is making a
>>> bad tradeoff.
>>>
>>> In the past, I have used other more heuristic measures as well. One of
>>> the
>>> key characteristics that I would like to see out of a clustering is a
>>> degree
>>> of stability. Thus, I look at the fractions of points that are assigned
>>> to
>>> each cluster or the distribution of distances from the cluster centroid.
>>> These values should be relatively stable when applied to heldout data.
>>>
>>> For text, you can actually compute perplexity which measures how well
>>> cluster membership predicts what words are used. This is nice because you
>>> don't have to worry about the entropy of real valued numbers.
>>>
>>
>> Do you have any references on any of the above approaches?
>>
>
>
>
> 
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 8584140013 (m)
> 4087730220 (fax)
>
