mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@farfetchers.com>
Subject Re: RowSimilarity
Date Sun, 13 May 2012 15:33:21 GMT
To paraphrase:

There is some internal threshold to be considered 'similar'. This is the 
one supplied with the 'threshold' option mentioned below and I need to 
do a special build to get this option activated? I assume it is not 
active because it has not been tested well?

So currently how is the threshold calculated? How can I determine its 
value? Can I vote that this be activated as an optional parameter in the 
future?

I ask this in part because I want to use RowSimilarity in an experiment 
to do something like a non-partitioning hierarchical clustering where 
I'll need to find close centroids in clusters calculated with different 
levels of specificity.

On 5/12/12 11:38 PM, Sebastian Schelter wrote:
> This could be simply due to the fact that there are less similar docs
> than the number specified in 'maxSimilaritiesPerRow'.
>
> consider() is only invoked if a threshold was specified.
>
> Best,
> Sebastian
>
>
> On 13.05.2012 08:25, Suneel Marthi wrote:
>>   Pat's question was that he was seeing less documents than that specified by 'maxSimilaritiesPerRow',
this could be happening due to the 'consider' functionality of the applied similarity measure.
>>
>>
>>
>> ________________________________
>>   From: Sebastian Schelter<ssc@apache.org>
>> To: user@mahout.apache.org
>> Sent: Sunday, May 13, 2012 2:08 AM
>> Subject: Re: RowSimilarity
>>
>> The option 'maxSimilaritiesPerRow' determines the maximum number of
>> similar docs/items/rows per row. It depends on your data if there are
>> enough similar rows per row, so you can't always get 20 similar docs.
>>
>> The option 'threshold' determines the minimum similarity value for a
>> pair of docs (otherwise it will be dropped). This option is not
>> activated by default however.
>>
>> Best,
>> Sebastian
>>
>> On 13.05.2012 01:29, Pat Ferrel wrote:
>>> I tried an experiment running RowSimilarity with 16 docs of short
>>> quotations on a similar subject. It looks to me that using tanimoto the
>>> largest pair-wise distance allowed for the similar docs was 0.4. Though
>>> I asked for 10 similar docs I got 0 to 10. I see this same effect with
>>> larger data sets but haven't seen an obvious cut-off point
>>>
>>> I was expecting to be able to make the decision about cut-off distance
>>> myself. In other words I was expecting to always get 20 similar docs
>>> when I asked for 20. It is useful to see what docs are at larger distances.
>>>
>>> How is RowSimilarity deciding when to cut-off the returned docs?
>>>
>
>

Mime
View raw message