I assume one or both has all the same ratings, at least in the overlapping
items. This means the standard deviation of their ratings is undefined, and
that's part of the formula. I think the answer is, that's just how it's
defined.
This tends to happen when the users have little overlap -- 1-2 items. And
ignoring that as a similarity is generally good.
But yes this is a reason you might not choose this metric.
On Thu, Jun 2, 2011 at 4:00 AM, Jason Smith wrote:
> What is the reasoning behind PearsonCorrelationSimilarity returning
> NaN for userSimilarity when the two user's overlapping reviews match
> up perfectly?
> In my case of a limited set of rating values (1 to 5 stars) it seems
> quite possible that a user with a smaller number of ratings might have
> overlapping ratings with other users. Am I missing something here.
>
> // Note that sum of X and sum of Y don't appear here since they are
> assumed to be 0;
> // the data is assumed to be centered.
> double denominator = Math.sqrt(sumX2) * Math.sqrt(sumY2);
> if (denominator == 0.0) {
> // One or both parties has -all- the same ratings;
> // can't really say much similarity under this measure
> return Double.NaN;
> }
> return sumXY / denominator;
>