mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel McEnnis <dmcen...@gmail.com>
Subject Re: Identify "less similar" documents
Date Wed, 20 Apr 2011 16:02:56 GMT
Claudia,

This gets into 'Goodness metrics' - measures of how good a cluster is.
 This metric is effectively the max distance metric - the maximum
distance from a vector to its cluster mean. It is a less common but
still useful metric. The most commonly used is average distance to the
cluster mean.

Daniel.

On Wed, Apr 20, 2011 at 10:35 AM, Claudia Grieco <grieco@crmpa.unisa.it> wrote:
> Thanks again.
>
> Does the radius of the cluster give information on the tightness of the cluster?
>
>
>
>
>
> Da: Ted Dunning [mailto:ted.dunning@gmail.com]
> Inviato: martedì 19 aprile 2011 18.57
> A: user@mahout.apache.org
> Cc: Claudia Grieco
> Oggetto: Re: Identify "less similar" documents
>
>
>
> Yes.  This makes sense.
>
>
>
> I think you might want to qualify X according to which cluster is closest.  Define a
function that estimates the percentile distance for members of each cluster.  There will
be one function per cluster.
>
>
>
> Then define a function for each new point that is the percentile score based on the distance
to the nearest cluster.   The issue with what you suggest is that some clusters are very
tight and others very loose.
>
> On Tue, Apr 19, 2011 at 2:55 AM, Claudia Grieco <grieco@crmpa.unisa.it> wrote:
>
> Thanks for the suggestion, I'm currently trying this hack:
> I take the documents of the training set and put in each cluster all the docs of a certain
category.
> I compute the centroid for each category cluster
> I compute the distance of each new document to all centroids (I'm using CosineDistanceMeasure)
and I identify as "outlier" the ones who have distance more than X
>
> Do you think this makes sense?
>
>
>
>

Mime
View raw message