mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Need to reduce execution time of RowSimilarityJob
Date Mon, 01 Oct 2012 20:25:09 GMT
The cosine similarity as computed by RowSimilarityJob is the cosine
similarity between the whole vectors.

see
org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
for details

At first both vectors are scaled to unit length in normalize() and after
this their dot product in similarity() (which can be computed from
elements that exist in both vectors) gives the cosine between those.

On 01.10.2012 21:52, bangbig wrote:
> I think it's better to understand how the RowSimilarityJob gets the result.
> For two items, 
> itemA, 0, 0,   a1, a2, a3, 0
> itemB, 0, b1, b2, b3, 0  , 0
> when computing, it just uses the blue parts of the vectors.
> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)  * sqrt(b2*b2 + b3*b3))
> 1) if itemA and itemB have just one common word, the result is 1;
> 2) if the values of the vectors are almost the same, the value would also be nearly 1;
> and for the two cases above, I think you can consider to use association rules to consider
the problem.
> 
> At 2012-10-01 20:53:16,yamo93 <yamo93@gmail.com> wrote:
>> It seems that RowSimilarityJob does not have the same weakness, but i 
>> also use CosineSimilarity. Why ?
>>
>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>> Yes, this is one of the weaknesses of this particular flavor of this
>>> particular similarity metric. The more sparse, the worse the problem
>>> is in general. There are some band-aid solutions like applying some
>>> kind of weight against similarities based on small intersection size.
>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>> which can introduce its own problems, or perhaps some mean value.
>>>
>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <yamo93@gmail.com> wrote:
>>>> Thanks for replying.
>>>>
>>>> So, documents with only one word in common have more chance to be similar
>>>> than documents with more words in common, right ?
>>>>
>>>>
>>>>
>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>> similarity and see if they are in fact collinear. This is still by far
>>>>> the most likely explanation. Remember that the vector similarity is
>>>>> computed over elements that exist in both vectors only. They just have
>>>>> to have 2 identical values for this to happen.
>>>>>
>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <yamo93@gmail.com> wrote:
>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>> It sounds like a bug somewhere.
>>>>>>
>>>>>>
>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>> It's possible this is correct. 1.0 is the maximum similarity
and
>>>>>>> occurs when two vector are just a scalar multiple of each other
(0
>>>>>>> angle between them). It's possible there are several of these,
and so
>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>
>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <yamo93@gmail.com>
wrote:
>>>>>>>> I saw something strange : all recommended items, returned
by
>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>> Is it normal ?
>>>>
>>
> 


Mime
View raw message