It seems that RowSimilarityJob does not have the same weakness, but i
also use CosineSimilarity. Why ?
On 10/01/2012 12:37 PM, Sean Owen wrote:
> Yes, this is one of the weaknesses of this particular flavor of this
> particular similarity metric. The more sparse, the worse the problem
> is in general. There are some bandaid solutions like applying some
> kind of weight against similarities based on small intersection size.
> Or you can pretend that missing values are 0 (PreferenceInferrer),
> which can introduce its own problems, or perhaps some mean value.
>
> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <yamo93@gmail.com> wrote:
>> Thanks for replying.
>>
>> So, documents with only one word in common have more chance to be similar
>> than documents with more words in common, right ?
>>
>>
>>
>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>> Similar items, right? You should look at the vectors that have 1.0
>>> similarity and see if they are in fact collinear. This is still by far
>>> the most likely explanation. Remember that the vector similarity is
>>> computed over elements that exist in both vectors only. They just have
>>> to have 2 identical values for this to happen.
>>>
>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <yamo93@gmail.com> wrote:
>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>> It sounds like a bug somewhere.
>>>>
>>>>
>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>> angle between them). It's possible there are several of these, and so
>>>>> their 1.0 similarities dominate the result.
>>>>>
>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <yamo93@gmail.com> wrote:
>>>>>> I saw something strange : all recommended items, returned by
>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>> Is it normal ?
>>
