mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <>
Subject Re: RowSimilarity
Date Sun, 13 May 2012 06:25:05 GMT
 Pat's question was that he was seeing less documents than that specified by 'maxSimilaritiesPerRow',
this could be happening due to the 'consider' functionality of the applied similarity measure.

 From: Sebastian Schelter <>
Sent: Sunday, May 13, 2012 2:08 AM
Subject: Re: RowSimilarity
The option 'maxSimilaritiesPerRow' determines the maximum number of
similar docs/items/rows per row. It depends on your data if there are
enough similar rows per row, so you can't always get 20 similar docs.

The option 'threshold' determines the minimum similarity value for a
pair of docs (otherwise it will be dropped). This option is not
activated by default however.


On 13.05.2012 01:29, Pat Ferrel wrote:
> I tried an experiment running RowSimilarity with 16 docs of short
> quotations on a similar subject. It looks to me that using tanimoto the
> largest pair-wise distance allowed for the similar docs was 0.4. Though
> I asked for 10 similar docs I got 0 to 10. I see this same effect with
> larger data sets but haven't seen an obvious cut-off point
> I was expecting to be able to make the decision about cut-off distance
> myself. In other words I was expecting to always get 20 similar docs
> when I asked for 20. It is useful to see what docs are at larger distances.
> How is RowSimilarity deciding when to cut-off the returned docs?
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message