Hello Seb,
In my comprehension, the algorithm is the same (except the normalization
part) as UncenteredCosine (with the drawback that vectors with only one
word in common have a distance of 1.0)... but the result are quite
different (is this just an effect of the consider() method which remove
irrelevant values ?) ...
I looked at the code but there is quite nothing in
org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,
the code seems to be in SimilarityReducer which is not so simple to
understand ...
Thanks for helping,
On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
> The cosine similarity as computed by RowSimilarityJob is the cosine
> similarity between the whole vectors.
>
> see
> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
> for details
>
> At first both vectors are scaled to unit length in normalize() and after
> this their dot product in similarity() (which can be computed from
> elements that exist in both vectors) gives the cosine between those.
>
> On 01.10.2012 21:52, bangbig wrote:
>> I think it's better to understand how the RowSimilarityJob gets the result.
>> For two items,
>> itemA, 0, 0, a1, a2, a3, 0
>> itemB, 0, b1, b2, b3, 0 , 0
>> when computing, it just uses the blue parts of the vectors.
>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2) * sqrt(b2*b2
+ b3*b3))
>> 1) if itemA and itemB have just one common word, the result is 1;
>> 2) if the values of the vectors are almost the same, the value would also be nearly
1;
>> and for the two cases above, I think you can consider to use association rules to
consider the problem.
>>
>> At 20121001 20:53:16,yamo93 <yamo93@gmail.com> wrote:
>>> It seems that RowSimilarityJob does not have the same weakness, but i
>>> also use CosineSimilarity. Why ?
>>>
>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>> Yes, this is one of the weaknesses of this particular flavor of this
>>>> particular similarity metric. The more sparse, the worse the problem
>>>> is in general. There are some bandaid solutions like applying some
>>>> kind of weight against similarities based on small intersection size.
>>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>>> which can introduce its own problems, or perhaps some mean value.
>>>>
>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <yamo93@gmail.com> wrote:
>>>>> Thanks for replying.
>>>>>
>>>>> So, documents with only one word in common have more chance to be similar
>>>>> than documents with more words in common, right ?
>>>>>
>>>>>
>>>>>
>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>>> similarity and see if they are in fact collinear. This is still by
far
>>>>>> the most likely explanation. Remember that the vector similarity
is
>>>>>> computed over elements that exist in both vectors only. They just
have
>>>>>> to have 2 identical values for this to happen.
>>>>>>
>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <yamo93@gmail.com>
wrote:
>>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>>> It sounds like a bug somewhere.
>>>>>>>
>>>>>>>
>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity
and
>>>>>>>> occurs when two vector are just a scalar multiple of each
other (0
>>>>>>>> angle between them). It's possible there are several of these,
and so
>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>
>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <yamo93@gmail.com>
wrote:
>>>>>>>>> I saw something strange : all recommended items, returned
by
>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>> Is it normal ?
