Okay so I rethought my question and realized that the paper never really
talked about collaborative filtering but just how to calculate itemitem
similarity in a scalable fashion. Perhaps this is the reason for why the
common ratings aren't used? Because that's not a prereq for this
calculation?
Although for my own clarity, I'd still like to get a better understanding
of what it means to calculate the correlation between sparse vectors where
you're normalizing each vector using a separate denominator.
P.S. If my question(s) don't make sense please let me know for it's very
possible I am completely misunderstanding something :).
Thanks again!
Amit
On Wed, Nov 27, 2013 at 8:23 AM, Amit Nithian <anithian@gmail.com> wrote:
> Hey Sebastian,
>
> Thanks again. Actually I'm glad that I am talking to you as it's your
> paper and presentation I have questions with! :)
>
> So to clarify my question further, looking at this presentation (
> http://isabeldrost.de/hadoop/slides/collabMahout.pdf) you have the
> following user x item matrix:
> M A I
> A 5 1 4
> B  2 5
> P 4 3 2
>
> If I want to calculate the pearson correlation between Matrix and
> Inception, I'd have the rating vectors:
> [5  4] vs [4 5 2].
>
> One of the steps in your paper is the normalization step which subtracts
> the mean item rating from each value and essentially do the L2Norm of this
> resulting vector (or in other words, the L2 norm of the meancentered
> vector ?)
>
> The question I have had is what is the average rating for Matrix and
> Inception? I can see the following:
> Matrix  4.5 (9/2), Inception  3 (6/2) because you only consider shared
> ratings
> Matrix  3 (9/3), Inception  3.667 (11/3) assuming that the missing
> rating is 0
> Matrix  4.5 (9/2), Inception  3.667 (11/3) subtract from the average of
> all nonzero ratings ==> This is what I believe the current implementation
> does.
>
> Unfortunately, neither of these yield the 0.47 listed in the presentation
> but that's a separate issue. In my testing, I see that Mahout Taste
> (nondistributed) uses the 1st approach while the distributed approach uses
> the 3rd approach.
>
> I am okay with #3; however I just want to understand that this is the case
> and that it's okay. This is why I was asking about pearson correlation
> between vectors of "different" lengths because the average rating is being
> computed using a denominator (number of users) that is different between
> the two (2 vs 3).
>
> I know you said in practice that people don't use Pearson to compute
> inferred ratings but this is just for my complete understanding (and since
> it's the example used in your presentation). This same question applies to
> cosine as you are doing an L2Norm of the vector as a preprocessing step
> and including/excluding nonshared ratings may make a difference.
>
> Thanks again!
> Amit
>
>
> On Wed, Nov 27, 2013 at 7:13 AM, Sebastian Schelter <
> ssc.open@googlemail.com> wrote:
>
>> Hi Amit,
>>
>> Yes, it gives different results. However in practice, most people don't
>> do rating prediction with Pearson coefficient, but use countbased
>> measures like the loglikelihood ratio test.
>>
>> The distributed code doesn't look at vectors of different lengths, but
>> simply assumes nonexistent ratings as zero.
>>
>> sebastian
>>
>> On 27.11.2013 16:09, Amit Nithian wrote:
>> > Comparing this against the non distributed (taste) gives different
>> answers
>> > for item item similarity as of course the non distributed looks only at
>> > corated items. I was more wondering if this difference in practice
>> mattered
>> > or not.
>> >
>> > Also I'm confused on how you can compute the Pearson similarity between
>> two
>> > vectors of different length which essentially is going on here I think?
>> >
>> > Thanks again
>> > Amit
>> > On Nov 27, 2013 9:06 AM, "Sebastian Schelter" <ssc.open@googlemail.com>
>> > wrote:
>> >
>> >> Yes, it is due to the parallel algorithm which only looks at coratings
>> >> from a given user.
>> >>
>> >>
>> >> On 27.11.2013 15:02, Amit Nithian wrote:
>> >>> Thanks Sebastian! Is there a particular reason for that?
>> >>> On Nov 27, 2013 7:47 AM, "Sebastian Schelter" <
>> ssc.open@googlemail.com>
>> >>> wrote:
>> >>>
>> >>>> Hi Amit,
>> >>>>
>> >>>> You are right, the noncorated items are not filtered out in the
>> >>>> distributed implementation.
>> >>>>
>> >>>> sebastian
>> >>>>
>> >>>>
>> >>>> On 26.11.2013 20:51, Amit Nithian wrote:
>> >>>>> Hi all,
>> >>>>>
>> >>>>> Apologies if this is a repeat question as I just joined the
list
>> but I
>> >>>> have
>> >>>>> a question about the way that metrics like Cosine and Pearson
are
>> >>>>> calculated in Hadoop "mode" (i.e. non Taste).
>> >>>>>
>> >>>>> As far as I understand, the vectors used for computing pairwise
item
>> >>>>> similarity in Taste are based on the corated items; however,
in the
>> >>>> Hadoop
>> >>>>> implementation, I don't see this done.
>> >>>>>
>> >>>>> The implementation of the distributed itemitem similarity comes
>> from
>> >>>> this
>> >>>>> paper http://ssc.io/wpcontent/uploads/2012/06/rec11schelter.pdf.
>> I
>> >>>> didn't
>> >>>>> see anything in this paper about filtering out those elements
from
>> the
>> >>>>> vectors not corated and this can make a difference especially
when
>> you
>> >>>>> normalize the ratings by dividing by the average item rating.
In
>> some
>> >>>>> cases, the # users to divide by can be fewer depending on the
>> >> sparseness
>> >>>> of
>> >>>>> the vector.
>> >>>>>
>> >>>>> Any clarity on this would be helpful.
>> >>>>>
>> >>>>> Thanks!
>> >>>>> Amit
>> >>>>>
>> >>>>
>> >>>>
>> >>>
>> >>
>> >>
>> >
>>
>>
>
