mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bangbig <lizhongliangg...@163.com>
Subject Re:Re: Need to reduce execution time of RowSimilarityJob
Date Wed, 03 Oct 2012 05:19:36 GMT
I think I will try to implement a version this night!
I have wrote one package that works directly on Hadoop, dealing with data extracted from HIVE.

At 2012-10-03 03:01:32,yamo93 <yamo93@gmail.com> wrote:
>You'll find in attachment a class that implements cosine distance as in 
>hadoop. I've just implemented the core method : itemSimilarity.
>
>On 10/02/2012 02:59 PM, yamo93 wrote:
>> Ok, i'll try this evening.
>>
>> On 10/02/2012 02:39 PM, Sebastian Schelter wrote:
>>> Would you like to create a patch for this?
>>>
>>> On 02.10.2012 14:36, yamo93 wrote:
>>>> +1 for the implementation over all entries.
>>>>
>>>> On 10/02/2012 11:50 AM, Sebastian Schelter wrote:
>>>>> I don't see why documents with only one word in common should have a
>>>>> similarity of 1.0 in RowSimilarityJob. consider() is only invoked 
>>>>> if you
>>>>> specify a threshold for the similarity.
>>>>>
>>>>> UncenteredCosineSimilarity works on matching entries only, which is
>>>>> problematic for documents, as empty entries have a meaning (0 term
>>>>> occurrences) as opposed to collaborative filtering data.
>>>>>
>>>>> Maybe we should remove UncenteredCosine andd create another similarity
>>>>> implementation that computes the cosine correctly over all entries.
>>>>>
>>>>> --sebastian
>>>>>
>>>>>
>>>>> On 02.10.2012 10:08, yamo93 wrote:
>>>>>> Hello Seb,
>>>>>>
>>>>>> In my comprehension, the algorithm is the same (except the 
>>>>>> normalization
>>>>>> part) as UncenteredCosine (with the drawback that vectors with 
>>>>>> only one
>>>>>> word in common have a distance of 1.0)... but the result are quite
>>>>>> different (is this just an effect of the consider() method which

>>>>>> remove
>>>>>> irrelevant values ?) ...
>>>>>>
>>>>>> I looked at the code but there is quite nothing in
>>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,

>>>>>>
>>>>>>
>>>>>> the code seems to be in SimilarityReducer which is not so simple
to
>>>>>> understand ...
>>>>>>
>>>>>> Thanks for helping,
>>>>>>
>>>>>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>>>>>>> The cosine similarity as computed by RowSimilarityJob is the
cosine
>>>>>>> similarity between the whole vectors.
>>>>>>>
>>>>>>> see
>>>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity

>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> for details
>>>>>>>
>>>>>>> At first both vectors are scaled to unit length in normalize()
and
>>>>>>> after
>>>>>>> this their dot product in similarity() (which can be computed
from
>>>>>>> elements that exist in both vectors) gives the cosine between
those.
>>>>>>>
>>>>>>> On 01.10.2012 21:52, bangbig wrote:
>>>>>>>> I think it's better to understand how the RowSimilarityJob
gets the
>>>>>>>> result.
>>>>>>>> For two items,
>>>>>>>> itemA, 0, 0, a1, a2, a3, 0
>>>>>>>> itemB, 0, b1, b2, b3, 0 , 0
>>>>>>>> when computing, it just uses the blue parts of the vectors.
>>>>>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1
+ a2* a2)
>>>>>>>> * sqrt(b2*b2 + b3*b3))
>>>>>>>> 1) if itemA and itemB have just one common word, the result
is 1;
>>>>>>>> 2) if the values of the vectors are almost the same, the
value 
>>>>>>>> would
>>>>>>>> also be nearly 1;
>>>>>>>> and for the two cases above, I think you can consider to
use
>>>>>>>> association rules to consider the problem.
>>>>>>>>
>>>>>>>> At 2012-10-01 20:53:16,yamo93 <yamo93@gmail.com> wrote:
>>>>>>>>> It seems that RowSimilarityJob does not have the same
weakness, 
>>>>>>>>> but i
>>>>>>>>> also use CosineSimilarity. Why ?
>>>>>>>>>
>>>>>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>>>>>>> Yes, this is one of the weaknesses of this particular
flavor 
>>>>>>>>>> of this
>>>>>>>>>> particular similarity metric. The more sparse, the
worse the 
>>>>>>>>>> problem
>>>>>>>>>> is in general. There are some band-aid solutions
like applying 
>>>>>>>>>> some
>>>>>>>>>> kind of weight against similarities based on small
intersection
>>>>>>>>>> size.
>>>>>>>>>> Or you can pretend that missing values are 0 
>>>>>>>>>> (PreferenceInferrer),
>>>>>>>>>> which can introduce its own problems, or perhaps
some mean value.
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <yamo93@gmail.com>
wrote:
>>>>>>>>>>> Thanks for replying.
>>>>>>>>>>>
>>>>>>>>>>> So, documents with only one word in common have
more chance 
>>>>>>>>>>> to be
>>>>>>>>>>> similar
>>>>>>>>>>> than documents with more words in common, right
?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>>>>>>> Similar items, right? You should look at
the vectors that 
>>>>>>>>>>>> have 1.0
>>>>>>>>>>>> similarity and see if they are in fact collinear.
This is still
>>>>>>>>>>>> by far
>>>>>>>>>>>> the most likely explanation. Remember that
the vector
>>>>>>>>>>>> similarity is
>>>>>>>>>>>> computed over elements that exist in both
vectors only. They 
>>>>>>>>>>>> just
>>>>>>>>>>>> have
>>>>>>>>>>>> to have 2 identical values for this to happen.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <yamo93@gmail.com>

>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> For each item, i have 10 recommended
items with a value of 
>>>>>>>>>>>>> 1.0.
>>>>>>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>>>>>>> It's possible this is correct. 1.0
is the maximum 
>>>>>>>>>>>>>> similarity and
>>>>>>>>>>>>>> occurs when two vector are just a
scalar multiple of each
>>>>>>>>>>>>>> other (0
>>>>>>>>>>>>>> angle between them). It's possible
there are several of 
>>>>>>>>>>>>>> these,
>>>>>>>>>>>>>> and so
>>>>>>>>>>>>>> their 1.0 similarities dominate the
result.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM,
yamo93 <yamo93@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> I saw something strange : all
recommended items, returned by
>>>>>>>>>>>>>>> mostSimilarItems(), have a value
of 1.0.
>>>>>>>>>>>>>>> Is it normal ?
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message