mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: RowSimilarity
Date Sun, 13 May 2012 06:38:26 GMT
This could be simply due to the fact that there are less similar docs
than the number specified in 'maxSimilaritiesPerRow'.

consider() is only invoked if a threshold was specified.

Best,
Sebastian


On 13.05.2012 08:25, Suneel Marthi wrote:
>  Pat's question was that he was seeing less documents than that specified by 'maxSimilaritiesPerRow',
this could be happening due to the 'consider' functionality of the applied similarity measure.
> 
> 
> 
> ________________________________
>  From: Sebastian Schelter <ssc@apache.org>
> To: user@mahout.apache.org 
> Sent: Sunday, May 13, 2012 2:08 AM
> Subject: Re: RowSimilarity
>  
> The option 'maxSimilaritiesPerRow' determines the maximum number of
> similar docs/items/rows per row. It depends on your data if there are
> enough similar rows per row, so you can't always get 20 similar docs.
> 
> The option 'threshold' determines the minimum similarity value for a
> pair of docs (otherwise it will be dropped). This option is not
> activated by default however.
> 
> Best,
> Sebastian
> 
> On 13.05.2012 01:29, Pat Ferrel wrote:
>> I tried an experiment running RowSimilarity with 16 docs of short
>> quotations on a similar subject. It looks to me that using tanimoto the
>> largest pair-wise distance allowed for the similar docs was 0.4. Though
>> I asked for 10 similar docs I got 0 to 10. I see this same effect with
>> larger data sets but haven't seen an obvious cut-off point
>>
>> I was expecting to be able to make the decision about cut-off distance
>> myself. In other words I was expecting to always get 20 similar docs
>> when I asked for 20. It is useful to see what docs are at larger distances.
>>
>> How is RowSimilarity deciding when to cut-off the returned docs?
>>


Mime
View raw message