Thanks guys! So the real question is not so much what's the average of the vector with the missing rating (although yes that was a question) but what's the average of the vector with all the ratings specified but the second rating that is not shared with the first user: [5 - 4] vs [4 5 2]. If we agree that the first is 4.5 then is the second one 11/3 or 3 ((4+2)/2)? Taste has this as ((4+2)/2) while distributed mode has it as 11/3. Since Taste (and Lenskit) is sequential, it can (and will only) look at co-occurring ratings whereas the Hadoop implementation doesn't. The paper that Sebastian wrote has a pre-processing step where (for Pearson) you subtract each element of an item-rating vector from the average rating which implies that each item-rating vector is treated independently of each other whereas in the sequential/non-distributed mode it's all considered together. My main reason for posting is because the Taste implementation of item-item similarity differs from the distributed implementation. Since I am totally new to this space and these similarities I wanted to understand if there is a reason for this difference and whether or not it matters. Sounds like from the discussion it doesn't matter but understanding why helps me explain this to others. My guess (and I'm glad Sebastian is on this list so he can help confirm/deny this.. sorry I'm not picking on you just happy to be able to talk to you about your good paper) is that considering co-occuring ratings in a distributed implementation would require access to the full matrix which defeats the parallel nature of computing item-item similarity? Thanks again! Amit On Sun, Dec 1, 2013 at 2:55 AM, Sean Owen wrote: > It's not an issue of how to be careful with sparsity and subtracting > means, although that's a valuable point in itself. The question is > what the mean is supposed to be. > > You can't think of missing ratings as 0 in general, and the example > here shows why: you're acting as if most movies are hated. Instead > they are excluded from the computation entirely. > > m_x should be 4.5 in the example here. That's consistent with > literature and the other implementations earlier in this project. > > I don't know the Hadoop implementation well enough, and wasn't sure > from the comments above, whether it does end up behaving as if it's > "4.5" or "3". If it's not 4.5 I would call that a bug. Items that > aren't co-rated can't meaningfully be included in this computation. > > > On Sun, Dec 1, 2013 at 8:29 AM, Ted Dunning wrote: > > Good point Amit. > > > > Not sure how much this matters. It may be that > > PearsonCorrelationSimilarity is bad name that should be > > PearonInspiredCorrelationSimilarity. My guess is that this > implementation > > is lifted directly from the very early recommendation literature and is > > reflective of the way that it was used back then. >