Good point Amit.
Not sure how much this matters. It may be that
PearsonCorrelationSimilarity is bad name that should be
PearonInspiredCorrelationSimilarity. My guess is that this implementation
is lifted directly from the very early recommendation literature and is
reflective of the way that it was used back then.
Remember that the context here is prediction of ratings. If you assume
that you really want correlation and that missing elements are zero, then
this is mathematically wrong. On the other hand, if you assume missing
elements are equal to the mean (whatever it is), then this definition is
correct.
In any case, I don't think that PearsonCorrelationSimilarity should be
"fixed" at this point. First of all, a substantial change here is somewhat
risky since there may be people who depend on current behavior. Second, I
think that this is almost never a particularly good recommendation
algorithm so even if the proposed change is a small improvement, it will
have negligible positive effect on the universe of production recommenders.
Remember that this function is not a stats routine. It is an embodiment of
recommendation practice. Were it the former, I would strongly recommend we
fix it.
On Sat, Nov 30, 2013 at 10:18 AM, Amit Nithian <anithian@gmail.com> wrote:
> Hi Ted,
>
> Thanks that is what I would have thought too but I don't think that the
> Pearson Similarity (in Hadoop mode) does this:
>
> in
>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.PearsonCorrelationSimilarity
> around line 31
>
> double average = vector.norm(1) / vector.getNumNonZeroElements();
> Which looks like it's taking the sum and dividing by the number of defined
> elements. Which would make my [5  4] average be 4.5.
>
> Thanks again
> Amit
>
> On Fri, Nov 29, 2013 at 10:34 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > On Fri, Nov 29, 2013 at 10:16 PM, Amit Nithian <anithian@gmail.com>
> wrote:
> >
> > > Hi Ted,
> > >
> > > Thanks for your response. I thought that the mean of a sparse vector is
> > > simply the mean of the "defined" elements? Why would the vectors become
> > > dense unless you're meaning that all the undefined elements (0?) now
> will
> > > be (0m_x)?
> > >
> >
> > Yes. Just so. All those zero elements become nonzero and the vector is
> > thus nondense.
> >
> >
> > >
> > > Looking at the following example:
> > > X = [5  4] and Y= [4 5 2].
> > >
> > > is m_x 4.5 or 3?
> >
> >
> > 3.
> >
> > This is because the elements of X are really 5, 0, and 4. The zero is
> just
> > not stored, but it still is the value of that element.
> >
> >
> > > Is m_y 11/3 or (6/2) because we ignore the "5" since it's
> > > counterpart in X is undefined?.
> > >
> >
> > 11/3
> >
>
