mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yamo93 <yam...@gmail.com>
Subject Re: Need to reduce execution time of RowSimilarityJob
Date Tue, 02 Oct 2012 12:36:58 GMT
+1 for the implementation over all entries.

On 10/02/2012 11:50 AM, Sebastian Schelter wrote:
> I don't see why documents with only one word in common should have a
> similarity of 1.0 in RowSimilarityJob. consider() is only invoked if you
> specify a threshold for the similarity.
>
> UncenteredCosineSimilarity works on matching entries only, which is
> problematic for documents, as empty entries have a meaning (0 term
> occurrences) as opposed to collaborative filtering data.
>
> Maybe we should remove UncenteredCosine andd create another similarity
> implementation that computes the cosine correctly over all entries.
>
> --sebastian
>
>
> On 02.10.2012 10:08, yamo93 wrote:
>> Hello Seb,
>>
>> In my comprehension, the algorithm is the same (except the normalization
>> part) as UncenteredCosine (with the drawback that vectors with only one
>> word in common have a distance of 1.0)... but the result are quite
>> different (is this just an effect of the consider() method which remove
>> irrelevant values ?) ...
>>
>> I looked at the code but there is quite nothing in
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,
>> the code seems to be in SimilarityReducer which is not so simple to
>> understand ...
>>
>> Thanks for helping,
>>
>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>>> The cosine similarity as computed by RowSimilarityJob is the cosine
>>> similarity between the whole vectors.
>>>
>>> see
>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
>>>
>>> for details
>>>
>>> At first both vectors are scaled to unit length in normalize() and after
>>> this their dot product in similarity() (which can be computed from
>>> elements that exist in both vectors) gives the cosine between those.
>>>
>>> On 01.10.2012 21:52, bangbig wrote:
>>>> I think it's better to understand how the RowSimilarityJob gets the
>>>> result.
>>>> For two items,
>>>> itemA, 0, 0,   a1, a2, a3, 0
>>>> itemB, 0, b1, b2, b3, 0  , 0
>>>> when computing, it just uses the blue parts of the vectors.
>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)
>>>> * sqrt(b2*b2 + b3*b3))
>>>> 1) if itemA and itemB have just one common word, the result is 1;
>>>> 2) if the values of the vectors are almost the same, the value would
>>>> also be nearly 1;
>>>> and for the two cases above, I think you can consider to use
>>>> association rules to consider the problem.
>>>>
>>>> At 2012-10-01 20:53:16,yamo93 <yamo93@gmail.com> wrote:
>>>>> It seems that RowSimilarityJob does not have the same weakness, but i
>>>>> also use CosineSimilarity. Why ?
>>>>>
>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>>> Yes, this is one of the weaknesses of this particular flavor of this
>>>>>> particular similarity metric. The more sparse, the worse the problem
>>>>>> is in general. There are some band-aid solutions like applying some
>>>>>> kind of weight against similarities based on small intersection size.
>>>>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>>>>> which can introduce its own problems, or perhaps some mean value.
>>>>>>
>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <yamo93@gmail.com>
wrote:
>>>>>>> Thanks for replying.
>>>>>>>
>>>>>>> So, documents with only one word in common have more chance to
be
>>>>>>> similar
>>>>>>> than documents with more words in common, right ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>>> Similar items, right? You should look at the vectors that
have 1.0
>>>>>>>> similarity and see if they are in fact collinear. This is
still
>>>>>>>> by far
>>>>>>>> the most likely explanation. Remember that the vector similarity
is
>>>>>>>> computed over elements that exist in both vectors only. They
just
>>>>>>>> have
>>>>>>>> to have 2 identical values for this to happen.
>>>>>>>>
>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <yamo93@gmail.com>
wrote:
>>>>>>>>> For each item, i have 10 recommended items with a value
of 1.0.
>>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>>> It's possible this is correct. 1.0 is the maximum
similarity and
>>>>>>>>>> occurs when two vector are just a scalar multiple
of each other (0
>>>>>>>>>> angle between them). It's possible there are several
of these,
>>>>>>>>>> and so
>>>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <yamo93@gmail.com>
wrote:
>>>>>>>>>>> I saw something strange : all recommended items,
returned by
>>>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>>>> Is it normal ?


Mime
View raw message