mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <suneel_mar...@yahoo.com>
Subject Re: RowSimilarity
Date Sun, 13 May 2012 02:06:55 GMT
The consider() method in the distance measure (Tanimoto in ur scenario) is the one that does
the cut-off.
All of the similarity measures (almost all of them) have some implementation of consider()
so as to cut-off the returned results.

Have a look at Sebastian's explanation in https://issues.apache.org/jira/browse/MAHOUT-803.




________________________________
 From: Pat Ferrel <pat@occamsmachete.com>
To: user@mahout.apache.org 
Sent: Saturday, May 12, 2012 7:29 PM
Subject: RowSimilarity
 
I tried an experiment running RowSimilarity with 16 docs of short quotations on a similar
subject. It looks to me that using tanimoto the largest pair-wise distance allowed for the
similar docs was 0.4. Though I asked for 10 similar docs I got 0 to 10. I see this same effect
with larger data sets but haven't seen an obvious cut-off point

I was expecting to be able to make the decision about cut-off distance myself. In other words
I was expecting to always get 20 similar docs when I asked for 20. It is useful to see what
docs are at larger distances.

How is RowSimilarity deciding when to cut-off the returned docs?
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message