mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <>
Subject Re: Question about Pearson Correlation in non-Taste mode
Date Sun, 01 Dec 2013 17:25:24 GMT
Hi Amit,

No need to excuse for picking on me, I'm happy about anyone digging into
the paper :)

The reason, I implemented Pearson in this (flawed) way has to do with
the way the parallel algorithm works:

It never compares two item vectors in memory, instead it preprocesses
the vectors and computes sparse dot products in parallel. The centering
which is usually done for Pearson correlation is dependent on which pair
of vectors you're currently looking at (and doesn't fit the parallel
algorithm). We had an earlier implementation that didn't have this flaw,
but was way slower than the current one.

Rating prediction on explicit feedback data like ratings for which
Pearson correlation is mostly used in CF, is a rather academic topic and
in science there are nearly no datasets that really require you to go to

On the other hand item prediction on implicit feedback data (like
clicks) is the common scenario in the majority of industry usecases, but
here count-based similarity measures like the loglikelihood ratio test
give much better results. The current implementation of Mahout's
distributed itembased recommender is clearly designed and tuned for the
latter usecase.

I hope that answers your question.


On 01.12.2013 18:10, Amit Nithian wrote:
> Thanks guys! So the real question is not so much what's the average of the
> vector with the missing rating (although yes that was a question) but
> what's the average of the vector with all the ratings specified but the
> second rating that is not shared with the first user:
> [5 - 4] vs [4 5 2].
> If we agree that the first is 4.5 then is the second one 11/3 or 3
> ((4+2)/2)? Taste has this as ((4+2)/2) while distributed mode has it as
> 11/3.
> Since Taste (and Lenskit) is sequential, it can (and will only) look at
> co-occurring ratings whereas the Hadoop implementation doesn't. The paper
> that Sebastian wrote has a pre-processing step where (for Pearson) you
> subtract each element of an item-rating vector from the average rating
> which implies that each item-rating vector is treated independently of each
> other whereas in the sequential/non-distributed mode it's all considered
> together.
> My main reason for posting is because the Taste implementation of item-item
> similarity differs from the distributed implementation. Since I am totally
> new to this space and these similarities I wanted to understand if there is
> a reason for this difference and whether or not it matters. Sounds like
> from the discussion it doesn't matter but understanding why helps me
> explain this to others.
> My guess (and I'm glad Sebastian is on this list so he can help
> confirm/deny this.. sorry I'm not picking on you just happy to be able to
> talk to you about your good paper) is that considering co-occuring ratings
> in a distributed implementation would require access to the full matrix
> which defeats the parallel nature of computing item-item similarity?
> Thanks again!
> Amit
> On Sun, Dec 1, 2013 at 2:55 AM, Sean Owen <> wrote:
>> It's not an issue of how to be careful with sparsity and subtracting
>> means, although that's a valuable point in itself. The question is
>> what the mean is supposed to be.
>> You can't think of missing ratings as 0 in general, and the example
>> here shows why: you're acting as if most movies are hated. Instead
>> they are excluded from the computation entirely.
>> m_x should be 4.5 in the example here. That's consistent with
>> literature and the other implementations earlier in this project.
>> I don't know the Hadoop implementation well enough, and wasn't sure
>> from the comments above, whether it does end up behaving as if it's
>> "4.5" or "3". If it's not 4.5 I would call that a bug. Items that
>> aren't co-rated can't meaningfully be included in this computation.
>> On Sun, Dec 1, 2013 at 8:29 AM, Ted Dunning <> wrote:
>>> Good point Amit.
>>> Not sure how much this matters.  It may be that
>>> PearsonCorrelationSimilarity is bad name that should be
>>> PearonInspiredCorrelationSimilarity.  My guess is that this
>> implementation
>>> is lifted directly from the very early recommendation literature and is
>>> reflective of the way that it was used back then.

View raw message