mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Plotting cluster quality
Date Fri, 22 Feb 2013 17:47:13 GMT
What does color mean here?

What about width of the box?

When you say median or mean of all cluster distances, do you mean across
that single run?

I think that this plot is fine as it is except that it needs a legend that
explains all of these issues.  My general rule of thumb is that most
figures should have what I call a "Kipling caption".  See the caption of
the first image here: to see what
I mean by this.  Imagine that there is a very mathematically inclined 4
year old who is looking at your diagram and quizzing you about every part.
 Answer all their questions in the caption and you have a Kipling caption.

For comparing different runs of the clustering or different algorithms, I
think that a cumulative distribution plot (using plot.ecdf) with all of the
different algorithms on one plot would be the best comparison tool.

On Fri, Feb 22, 2013 at 8:33 AM, Dan Filimon <>wrote:

> As most of the regulars know, I'm working with Ted Dunning on a new
> clustering framework for Mahout that should land in 0.8.
> Part of my work is comparing the clustering quality of the new code
> with the existing Mahout implementation.
> I compiled a CSV of the quality data [1]. I ran 5 runs of the
> clustering on the 20 newsgroups data set comparing Mahout KMeans (km),
> Ball KMeans (bkm), Streaming KMeans (skm) and Streaming KMeans
> followed by Ball KMeans (bskm).
> I'm looking at now making some appealing plots for the data. For
> instance, I think want to make box plots of individual clustering
> runs. Here's an example [2] of what a clustering looks like for one
> run of Mahout's standard k-means.
> There's a box for each cluster, the mean distance is the thick line,
> the limits are the 1st and 3rd quartiles and the whiskers are the min
> and max distances.
> The blue horizontal line is the mean of all average cluster distances.
> The green horizontal line is the median of all average cluster distances.
> I intend on making similar plots for the other runs and then
> aggregating the means of the runs into box plots for the different
> classes of k-means.
> The main result being that streaming k-means + ball k-means (as done
> in the MR) gives a high quality clustering.
> How do you feel about this plot? Is it too dense? Too colorful? Should
> I not draw the median any more?
> What are some other good ways of plotting the quality given the data set?
> Thanks!
> [1]
> [2]

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message