mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <>
Subject Re: RowSimilarity
Date Sun, 13 May 2012 16:10:09 GMT
Hi Pat,

RowSimilarityJob allows the use of a lot of different similarity
measures (cosine, jaccard coefficient, number of cooccurrences, etc) all
of which compute a single number for a pair of vectors that denotes how
similar those are. All these measures have the characteristic that two
vectors that do not share at least one non-zero value in a single
dimension are considered not similar (have similarity 0).

In general, an all-pairs comparison, as it is conducted by
RowSimilarityJob, has quadratic complexity and is therefore not scalable.

If we have sparse data such as text or ratings however, we can exploit
the fact that we only need to compare pairs which share at least one
non-zero value in a dimension. This is the basic idea behind row
similarity job to avoid an all-pairs comparison.

In some real-world usecases you will furthermore encounter a lot of
pairs with near-zero similarities that are of little value for you. To
be able to avoid computing these, RowSimilarityJob provides the option
to specify a minimum threshold so that it ignores pairs with a
similarity value below this threshold. This threshold is data-dependent
and you have to experimentally find it.


On 13.05.2012 17:33, Pat Ferrel wrote:
> To paraphrase:
> There is some internal threshold to be considered 'similar'. This is the
> one supplied with the 'threshold' option mentioned below and I need to
> do a special build to get this option activated? I assume it is not
> active because it has not been tested well?
> So currently how is the threshold calculated? How can I determine its
> value? Can I vote that this be activated as an optional parameter in the
> future?
> I ask this in part because I want to use RowSimilarity in an experiment
> to do something like a non-partitioning hierarchical clustering where
> I'll need to find close centroids in clusters calculated with different
> levels of specificity.
> On 5/12/12 11:38 PM, Sebastian Schelter wrote:
>> This could be simply due to the fact that there are less similar docs
>> than the number specified in 'maxSimilaritiesPerRow'.
>> consider() is only invoked if a threshold was specified.
>> Best,
>> Sebastian
>> On 13.05.2012 08:25, Suneel Marthi wrote:
>>>   Pat's question was that he was seeing less documents than that
>>> specified by 'maxSimilaritiesPerRow', this could be happening due to
>>> the 'consider' functionality of the applied similarity measure.
>>> ________________________________
>>>   From: Sebastian Schelter<>
>>> To:
>>> Sent: Sunday, May 13, 2012 2:08 AM
>>> Subject: Re: RowSimilarity
>>> The option 'maxSimilaritiesPerRow' determines the maximum number of
>>> similar docs/items/rows per row. It depends on your data if there are
>>> enough similar rows per row, so you can't always get 20 similar docs.
>>> The option 'threshold' determines the minimum similarity value for a
>>> pair of docs (otherwise it will be dropped). This option is not
>>> activated by default however.
>>> Best,
>>> Sebastian
>>> On 13.05.2012 01:29, Pat Ferrel wrote:
>>>> I tried an experiment running RowSimilarity with 16 docs of short
>>>> quotations on a similar subject. It looks to me that using tanimoto the
>>>> largest pair-wise distance allowed for the similar docs was 0.4. Though
>>>> I asked for 10 similar docs I got 0 to 10. I see this same effect with
>>>> larger data sets but haven't seen an obvious cut-off point
>>>> I was expecting to be able to make the decision about cut-off distance
>>>> myself. In other words I was expecting to always get 20 similar docs
>>>> when I asked for 20. It is useful to see what docs are at larger
>>>> distances.
>>>> How is RowSimilarity deciding when to cut-off the returned docs?

View raw message