mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Vector distance within a cluster
Date Tue, 26 Feb 2013 22:00:37 GMT
Chris,

How are you doing your manual judgement step?  Coherence against an
external standard?  Or internal consistency/homogeneity?

Except for unusual situations it is to be expected that most clusterings
are not particularly stable (i.e. will no reproduce the same clusters from
run to run).  As such, it is also unlikely that they will reproduce
externally defined clusters any more than they will reproduce their own
results.

Likewise, there is no guarantee that the results will be easily
interpretable.  One thought along these lines is to add L_1 regularization
to the k-means algorithm.  Another is to look into what the carrot project
has done where, according to the developers, they have put some effort into
making clusters that are easily summarizable.  This might be similar in
effect to the regularization step I just mentioned.

On Tue, Feb 26, 2013 at 7:02 AM, Chris Harrington <chris@heystaks.com>wrote:

> Well, what I'm trying to do is create clusters of topically similar
> content via kmeans.
>
> Since I'm basing validity on topics there's a manual judgement step.
> And that manual step is taking a prohibitive amount of time to heck many
> clustering runs hence the desire for some stats to indicate roughly how
> good the clusters are.
>
> So I' want some stats that, at a glance, I'll be able to tell which
> clusters "should" be good and manually check them instead of having to
> check each and every one.
>
> I was thinking that a file with
>
> 1. the number of clusters,
> 2. the avg of all points to every other point
> 3. the avg distance of the points furthest from the center to all other
> points, (furthest 25% of all points within a cluster)
> 4. the avg distance of the points closest to the center to all other point
> (closest 25% of all points within a cluster)
>
> would allow me to quickly see if I should even bother manually checking
> the clustering output, the logic being that if 4,3 and 2 are similar in
> value then it's probably a decent cluster and I can manually check it. Also
> a comparison of 3 vs 2 would indicate if the cluster contains a number of
> distant outliers and 4 vs 2 would should show roughly how dense a cluster
> is.
>
> This makes sense right? or am I barking up the wrong tree?
>
> On 25 Feb 2013, at 20:15, Ted Dunning wrote:
>
> > The best way to evaluate a cluster really depends on what your purpose
> is.
> >
> > My own purpose is typically to use the clustering as a description of the
> > probability distribution of data.
> >
> > For that purpose, the best evaluation is distance to centroids for
> held-out
> > data.  The use of held-out data is critical here since otherwise you
> could
> > just put a single cluster at every data point and get zero distance for
> the
> > original data.  For held-out data, of course, the story would be
> different.
> >
> > This view of things is very good from the standpoint of machine learning
> > and data compression, but might be less useful for certain purposes that
> > have to do with explanation of data in human readable form.  My
> experience
> > is that it is common for a clustering algorithm to be very good as a
> > probability distribution description but quite bad for human inspection.
> >
> > My own tendency would be to adapt the outline you gave to work on
> held-out
> > data instead of the original training data.
> >
> > On Mon, Feb 25, 2013 at 4:27 AM, Chris Harrington <chris@heystaks.com
> >wrote:
> >
> >> Hi all,
> >>
> >> I want to find all the vectors within a cluster and then find the
> distance
> >> between them and every other vector within a cluster, in hopes this will
> >> give me a good idea of how similar each vector within a cluster is as
> well
> >> as identify outlier vectors.
> >>
> >> So there are 2 things I want to ask.
> >>
> >> 1. Is this a sensible approach to evaluating the cluster quality?
> >>
> >> 2. Is the correct file to get this info from the
> >> clusteredPoints/parts-m-00000 file?
> >>
> >> Thanks,
> >> Chris
> >>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message