mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Identify "less similar" documents
Date Tue, 19 Apr 2011 16:56:35 GMT
Yes.  This makes sense.

I think you might want to qualify X according to which cluster is closest.
 Define a function that estimates the percentile distance for members of
each cluster.  There will be one function per cluster.

Then define a function for each new point that is the percentile score based
on the distance to the nearest cluster.   The issue with what you suggest is
that some clusters are very tight and others very loose.

On Tue, Apr 19, 2011 at 2:55 AM, Claudia Grieco <>wrote:

> Thanks for the suggestion, I'm currently trying this hack:
> I take the documents of the training set and put in each cluster all the
> docs of a certain category.
> I compute the centroid for each category cluster
> I compute the distance of each new document to all centroids (I'm using
> CosineDistanceMeasure) and I identify as "outlier" the ones who have
> distance more than X
> Do you think this makes sense?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message