Thanks guys! So the real question is not so much what's the average of the
vector with the missing rating (although yes that was a question) but
what's the average of the vector with all the ratings specified but the
second rating that is not shared with the first user:
[5  4] vs [4 5 2].
If we agree that the first is 4.5 then is the second one 11/3 or 3
((4+2)/2)? Taste has this as ((4+2)/2) while distributed mode has it as
11/3.
Since Taste (and Lenskit) is sequential, it can (and will only) look at
cooccurring ratings whereas the Hadoop implementation doesn't. The paper
that Sebastian wrote has a preprocessing step where (for Pearson) you
subtract each element of an itemrating vector from the average rating
which implies that each itemrating vector is treated independently of each
other whereas in the sequential/nondistributed mode it's all considered
together.
My main reason for posting is because the Taste implementation of itemitem
similarity differs from the distributed implementation. Since I am totally
new to this space and these similarities I wanted to understand if there is
a reason for this difference and whether or not it matters. Sounds like
from the discussion it doesn't matter but understanding why helps me
explain this to others.
My guess (and I'm glad Sebastian is on this list so he can help
confirm/deny this.. sorry I'm not picking on you just happy to be able to
talk to you about your good paper) is that considering cooccuring ratings
in a distributed implementation would require access to the full matrix
which defeats the parallel nature of computing itemitem similarity?
Thanks again!
Amit
On Sun, Dec 1, 2013 at 2:55 AM, Sean Owen <srowen@gmail.com> wrote:
> It's not an issue of how to be careful with sparsity and subtracting
> means, although that's a valuable point in itself. The question is
> what the mean is supposed to be.
>
> You can't think of missing ratings as 0 in general, and the example
> here shows why: you're acting as if most movies are hated. Instead
> they are excluded from the computation entirely.
>
> m_x should be 4.5 in the example here. That's consistent with
> literature and the other implementations earlier in this project.
>
> I don't know the Hadoop implementation well enough, and wasn't sure
> from the comments above, whether it does end up behaving as if it's
> "4.5" or "3". If it's not 4.5 I would call that a bug. Items that
> aren't corated can't meaningfully be included in this computation.
>
>
> On Sun, Dec 1, 2013 at 8:29 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> > Good point Amit.
> >
> > Not sure how much this matters. It may be that
> > PearsonCorrelationSimilarity is bad name that should be
> > PearonInspiredCorrelationSimilarity. My guess is that this
> implementation
> > is lifted directly from the very early recommendation literature and is
> > reflective of the way that it was used back then.
>
