mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject RowSimilarity
Date Sat, 12 May 2012 23:29:44 GMT
I tried an experiment running RowSimilarity with 16 docs of short 
quotations on a similar subject. It looks to me that using tanimoto the 
largest pair-wise distance allowed for the similar docs was 0.4. Though 
I asked for 10 similar docs I got 0 to 10. I see this same effect with 
larger data sets but haven't seen an obvious cut-off point

I was expecting to be able to make the decision about cut-off distance 
myself. In other words I was expecting to always get 20 similar docs 
when I asked for 20. It is useful to see what docs are at larger distances.

How is RowSimilarity deciding when to cut-off the returned docs?

View raw message