Hi Chris,
I'm running similar cluster quality tests and wrote code that gets
some of the statistics you want.
Have a look at [1], at the summarizeClusterDistances() method. You
give it the centroids and data points you have and use it to return a
list of OnlineSummarizers with the relevant distance statistics (your
classes are probably not the same but this is a good starting point).
I computed the distances from each point to its cluster's center and
summarized the distances in an OnlineSummarizer.
You then get (similar to what you ask for):
1. the number of points/cluster
2. average distances to the center (getMean())
3. first quartile (getQuartile(1))
4. third quartile (getQuartile(3))
I realize that you're talking about other kinds of distances (between
any two points), but information about the distances to the center
also give you information about the quality of the clustering (an
upper bound for the distance between any 2 points is 2 * max distance
to the center, in the worst case of 2 points on opposite sides of a
diameter).
And here [2] is code that I use to make a CSV of these cluster statistics.
[1] https://github.com/dfilimon/mahout/blob/skm/examples/src/main/java/org/apache/mahout/clustering/streaming/utils/ExperimentUtils.java
[2] https://github.com/dfilimon/mahout/blob/skm/examples/src/main/java/org/apache/mahout/clustering/streaming/tools/ClusterQuality20NewsGroups.java#L97
On Tue, Feb 26, 2013 at 5:02 PM, Chris Harrington wrote:
> Well, what I'm trying to do is create clusters of topically similar content via kmeans.
>
> Since I'm basing validity on topics there's a manual judgement step.
> And that manual step is taking a prohibitive amount of time to heck many clustering runs hence the desire for some stats to indicate roughly how good the clusters are.
>
> So I' want some stats that, at a glance, I'll be able to tell which clusters "should" be good and manually check them instead of having to check each and every one.
>
> I was thinking that a file with
>
> 1. the number of clusters,
> 2. the avg of all points to every other point
> 3. the avg distance of the points furthest from the center to all other points, (furthest 25% of all points within a cluster)
> 4. the avg distance of the points closest to the center to all other point (closest 25% of all points within a cluster)
>
> would allow me to quickly see if I should even bother manually checking the clustering output, the logic being that if 4,3 and 2 are similar in value then it's probably a decent cluster and I can manually check it. Also a comparison of 3 vs 2 would indicate if the cluster contains a number of distant outliers and 4 vs 2 would should show roughly how dense a cluster is.
>
> This makes sense right? or am I barking up the wrong tree?
>
> On 25 Feb 2013, at 20:15, Ted Dunning wrote:
>
>> The best way to evaluate a cluster really depends on what your purpose is.
>>
>> My own purpose is typically to use the clustering as a description of the
>> probability distribution of data.
>>
>> For that purpose, the best evaluation is distance to centroids for held-out
>> data. The use of held-out data is critical here since otherwise you could
>> just put a single cluster at every data point and get zero distance for the
>> original data. For held-out data, of course, the story would be different.
>>
>> This view of things is very good from the standpoint of machine learning
>> and data compression, but might be less useful for certain purposes that
>> have to do with explanation of data in human readable form. My experience
>> is that it is common for a clustering algorithm to be very good as a
>> probability distribution description but quite bad for human inspection.
>>
>> My own tendency would be to adapt the outline you gave to work on held-out
>> data instead of the original training data.
>>
>> On Mon, Feb 25, 2013 at 4:27 AM, Chris Harrington wrote:
>>
>>> Hi all,
>>>
>>> I want to find all the vectors within a cluster and then find the distance
>>> between them and every other vector within a cluster, in hopes this will
>>> give me a good idea of how similar each vector within a cluster is as well
>>> as identify outlier vectors.
>>>
>>> So there are 2 things I want to ask.
>>>
>>> 1. Is this a sensible approach to evaluating the cluster quality?
>>>
>>> 2. Is the correct file to get this info from the
>>> clusteredPoints/parts-m-00000 file?
>>>
>>> Thanks,
>>> Chris
>>>
>>>
>>>
>