mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Claudia Grieco" <gri...@crmpa.unisa.it>
Subject R: Identify "less similar" documents
Date Wed, 20 Apr 2011 14:35:00 GMT
Thanks again.

Does the radius of the cluster give information on the tightness of the cluster?

 

 

Da: Ted Dunning [mailto:ted.dunning@gmail.com] 
Inviato: martedì 19 aprile 2011 18.57
A: user@mahout.apache.org
Cc: Claudia Grieco
Oggetto: Re: Identify "less similar" documents

 

Yes.  This makes sense.

 

I think you might want to qualify X according to which cluster is closest.  Define a function
that estimates the percentile distance for members of each cluster.  There will be one function
per cluster.

 

Then define a function for each new point that is the percentile score based on the distance
to the nearest cluster.   The issue with what you suggest is that some clusters are very tight
and others very loose.

On Tue, Apr 19, 2011 at 2:55 AM, Claudia Grieco <grieco@crmpa.unisa.it> wrote:

Thanks for the suggestion, I'm currently trying this hack:
I take the documents of the training set and put in each cluster all the docs of a certain
category.
I compute the centroid for each category cluster
I compute the distance of each new document to all centroids (I'm using CosineDistanceMeasure)
and I identify as "outlier" the ones who have distance more than X

Do you think this makes sense?

 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message