mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Nithian <anith...@gmail.com>
Subject Re: Question about Pearson Correlation in non-Taste mode
Date Wed, 27 Nov 2013 16:23:45 GMT
Hey Sebastian,

Thanks again. Actually I'm glad that I am talking to you as it's your paper
and presentation I have questions with! :-)

So to clarify my question further, looking at this presentation (
http://isabel-drost.de/hadoop/slides/collabMahout.pdf) you have the
following user x item matrix:
    M   A   I
A  5    1   4
B  -    2    5
P  4   3    2

If I want to calculate the pearson correlation between Matrix and
Inception, I'd have the rating vectors:
[5 - 4] vs [4 5 2].

One of the steps in your paper is the normalization step which subtracts
the mean item rating from each value and essentially do the L2Norm of this
resulting vector (or in other words, the L2 norm of the mean-centered
vector ?)

The question I have had is what is the average rating for Matrix and
Inception? I can see the following:
Matrix - 4.5 (9/2), Inception - 3 (6/2) because you only consider shared
ratings
Matrix - 3 (9/3), Inception - 3.667 (11/3) assuming that the missing rating
is 0
Matrix - 4.5 (9/2), Inception - 3.667 (11/3) subtract from the average of
all non-zero ratings ==> This is what I believe the current implementation
does.

Unfortunately, neither of these yield the 0.47 listed in the presentation
but that's a separate issue. In my testing, I see that Mahout Taste
(non-distributed) uses the 1st approach while the distributed approach uses
the 3rd approach.

I am okay with #3; however I just want to understand that this is the case
and that it's okay. This is why I was asking about pearson correlation
between vectors of "different" lengths because the average rating is being
computed using a denominator (number of users) that is different between
the two (2 vs 3).

I know you said in practice that people don't use Pearson to compute
inferred ratings but this is just for my complete understanding (and since
it's the example used in your presentation). This same question applies to
cosine as you are doing an L2-Norm of the vector as a pre-processing step
and including/excluding non-shared ratings may make a difference.

Thanks again!
Amit


On Wed, Nov 27, 2013 at 7:13 AM, Sebastian Schelter <ssc.open@googlemail.com
> wrote:

> Hi Amit,
>
> Yes, it gives different results. However in practice, most people don't
> do rating prediction with Pearson coefficient, but use count-based
> measures like the loglikelihood ratio test.
>
> The distributed code doesn't look at vectors of different lengths, but
> simply assumes non-existent ratings as zero.
>
> --sebastian
>
> On 27.11.2013 16:09, Amit Nithian wrote:
> > Comparing this against the non distributed (taste) gives different
> answers
> > for item item similarity as of course the non distributed looks only at
> > corated items. I was more wondering if this difference in practice
> mattered
> > or not.
> >
> > Also I'm confused on how you can compute the Pearson similarity between
> two
> > vectors of different length which essentially is going on here I think?
> >
> > Thanks again
> > Amit
> > On Nov 27, 2013 9:06 AM, "Sebastian Schelter" <ssc.open@googlemail.com>
> > wrote:
> >
> >> Yes, it is due to the parallel algorithm which only looks at co-ratings
> >> from a given user.
> >>
> >>
> >> On 27.11.2013 15:02, Amit Nithian wrote:
> >>> Thanks Sebastian! Is there a particular reason for that?
> >>> On Nov 27, 2013 7:47 AM, "Sebastian Schelter" <ssc.open@googlemail.com
> >
> >>> wrote:
> >>>
> >>>> Hi Amit,
> >>>>
> >>>> You are right, the non-corated items are not filtered out in the
> >>>> distributed implementation.
> >>>>
> >>>> --sebastian
> >>>>
> >>>>
> >>>> On 26.11.2013 20:51, Amit Nithian wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> Apologies if this is a repeat question as I just joined the list
but
> I
> >>>> have
> >>>>> a question about the way that metrics like Cosine and Pearson are
> >>>>> calculated in Hadoop "mode" (i.e. non Taste).
> >>>>>
> >>>>> As far as I understand, the vectors used for computing pairwise
item
> >>>>> similarity in Taste are based on the co-rated items; however, in
the
> >>>> Hadoop
> >>>>> implementation, I don't see this done.
> >>>>>
> >>>>> The implementation of the distributed item-item similarity comes
from
> >>>> this
> >>>>> paper http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf.
I
> >>>> didn't
> >>>>> see anything in this paper about filtering out those elements from
> the
> >>>>> vectors not co-rated and this can make a difference especially when
> you
> >>>>> normalize the ratings by dividing by the average item rating. In
some
> >>>>> cases, the # users to divide by can be fewer depending on the
> >> sparseness
> >>>> of
> >>>>> the vector.
> >>>>>
> >>>>> Any clarity on this would be helpful.
> >>>>>
> >>>>> Thanks!
> >>>>> Amit
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message