Hey Sebastian,
Thanks again. Actually I'm glad that I am talking to you as it's your paper
and presentation I have questions with! :)
So to clarify my question further, looking at this presentation (
http://isabeldrost.de/hadoop/slides/collabMahout.pdf) you have the
following user x item matrix:
M A I
A 5 1 4
B  2 5
P 4 3 2
If I want to calculate the pearson correlation between Matrix and
Inception, I'd have the rating vectors:
[5  4] vs [4 5 2].
One of the steps in your paper is the normalization step which subtracts
the mean item rating from each value and essentially do the L2Norm of this
resulting vector (or in other words, the L2 norm of the meancentered
vector ?)
The question I have had is what is the average rating for Matrix and
Inception? I can see the following:
Matrix  4.5 (9/2), Inception  3 (6/2) because you only consider shared
ratings
Matrix  3 (9/3), Inception  3.667 (11/3) assuming that the missing rating
is 0
Matrix  4.5 (9/2), Inception  3.667 (11/3) subtract from the average of
all nonzero ratings ==> This is what I believe the current implementation
does.
Unfortunately, neither of these yield the 0.47 listed in the presentation
but that's a separate issue. In my testing, I see that Mahout Taste
(nondistributed) uses the 1st approach while the distributed approach uses
the 3rd approach.
I am okay with #3; however I just want to understand that this is the case
and that it's okay. This is why I was asking about pearson correlation
between vectors of "different" lengths because the average rating is being
computed using a denominator (number of users) that is different between
the two (2 vs 3).
I know you said in practice that people don't use Pearson to compute
inferred ratings but this is just for my complete understanding (and since
it's the example used in your presentation). This same question applies to
cosine as you are doing an L2Norm of the vector as a preprocessing step
and including/excluding nonshared ratings may make a difference.
Thanks again!
Amit
On Wed, Nov 27, 2013 at 7:13 AM, Sebastian Schelter <ssc.open@googlemail.com
> wrote:
> Hi Amit,
>
> Yes, it gives different results. However in practice, most people don't
> do rating prediction with Pearson coefficient, but use countbased
> measures like the loglikelihood ratio test.
>
> The distributed code doesn't look at vectors of different lengths, but
> simply assumes nonexistent ratings as zero.
>
> sebastian
>
> On 27.11.2013 16:09, Amit Nithian wrote:
> > Comparing this against the non distributed (taste) gives different
> answers
> > for item item similarity as of course the non distributed looks only at
> > corated items. I was more wondering if this difference in practice
> mattered
> > or not.
> >
> > Also I'm confused on how you can compute the Pearson similarity between
> two
> > vectors of different length which essentially is going on here I think?
> >
> > Thanks again
> > Amit
> > On Nov 27, 2013 9:06 AM, "Sebastian Schelter" <ssc.open@googlemail.com>
> > wrote:
> >
> >> Yes, it is due to the parallel algorithm which only looks at coratings
> >> from a given user.
> >>
> >>
> >> On 27.11.2013 15:02, Amit Nithian wrote:
> >>> Thanks Sebastian! Is there a particular reason for that?
> >>> On Nov 27, 2013 7:47 AM, "Sebastian Schelter" <ssc.open@googlemail.com
> >
> >>> wrote:
> >>>
> >>>> Hi Amit,
> >>>>
> >>>> You are right, the noncorated items are not filtered out in the
> >>>> distributed implementation.
> >>>>
> >>>> sebastian
> >>>>
> >>>>
> >>>> On 26.11.2013 20:51, Amit Nithian wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> Apologies if this is a repeat question as I just joined the list
but
> I
> >>>> have
> >>>>> a question about the way that metrics like Cosine and Pearson are
> >>>>> calculated in Hadoop "mode" (i.e. non Taste).
> >>>>>
> >>>>> As far as I understand, the vectors used for computing pairwise
item
> >>>>> similarity in Taste are based on the corated items; however, in
the
> >>>> Hadoop
> >>>>> implementation, I don't see this done.
> >>>>>
> >>>>> The implementation of the distributed itemitem similarity comes
from
> >>>> this
> >>>>> paper http://ssc.io/wpcontent/uploads/2012/06/rec11schelter.pdf.
I
> >>>> didn't
> >>>>> see anything in this paper about filtering out those elements from
> the
> >>>>> vectors not corated and this can make a difference especially when
> you
> >>>>> normalize the ratings by dividing by the average item rating. In
some
> >>>>> cases, the # users to divide by can be fewer depending on the
> >> sparseness
> >>>> of
> >>>>> the vector.
> >>>>>
> >>>>> Any clarity on this would be helpful.
> >>>>>
> >>>>> Thanks!
> >>>>> Amit
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>
