mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Question about Pearson Correlation in non-Taste mode
Date Sat, 07 Dec 2013 00:55:42 GMT
See

http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf
http://arxiv.org/abs/1207.1847





On Fri, Dec 6, 2013 at 1:09 PM, Amit Nithian <anithian@gmail.com> wrote:

> Hey Sebastian,
>
> Thanks again for the explanation. So now you have me intrigued about
> something else. Why is it that logliklihood ratio test is a better measure
> for essentially implicit ratings? Are there resources/research papers you
> can point me to explaining this?
>
> Take care
> Amit
>
>
> On Sun, Dec 1, 2013 at 9:25 AM, Sebastian Schelter
> <ssc.open@googlemail.com>wrote:
>
> > Hi Amit,
> >
> > No need to excuse for picking on me, I'm happy about anyone digging into
> > the paper :)
> >
> > The reason, I implemented Pearson in this (flawed) way has to do with
> > the way the parallel algorithm works:
> >
> > It never compares two item vectors in memory, instead it preprocesses
> > the vectors and computes sparse dot products in parallel. The centering
> > which is usually done for Pearson correlation is dependent on which pair
> > of vectors you're currently looking at (and doesn't fit the parallel
> > algorithm). We had an earlier implementation that didn't have this flaw,
> > but was way slower than the current one.
> >
> > Rating prediction on explicit feedback data like ratings for which
> > Pearson correlation is mostly used in CF, is a rather academic topic and
> > in science there are nearly no datasets that really require you to go to
> > Hadoop.
> >
> > On the other hand item prediction on implicit feedback data (like
> > clicks) is the common scenario in the majority of industry usecases, but
> > here count-based similarity measures like the loglikelihood ratio test
> > give much better results. The current implementation of Mahout's
> > distributed itembased recommender is clearly designed and tuned for the
> > latter usecase.
> >
> > I hope that answers your question.
> >
> > --sebastian
> >
> > On 01.12.2013 18:10, Amit Nithian wrote:
> > > Thanks guys! So the real question is not so much what's the average of
> > the
> > > vector with the missing rating (although yes that was a question) but
> > > what's the average of the vector with all the ratings specified but the
> > > second rating that is not shared with the first user:
> > > [5 - 4] vs [4 5 2].
> > >
> > > If we agree that the first is 4.5 then is the second one 11/3 or 3
> > > ((4+2)/2)? Taste has this as ((4+2)/2) while distributed mode has it as
> > > 11/3.
> > >
> > > Since Taste (and Lenskit) is sequential, it can (and will only) look at
> > > co-occurring ratings whereas the Hadoop implementation doesn't. The
> paper
> > > that Sebastian wrote has a pre-processing step where (for Pearson) you
> > > subtract each element of an item-rating vector from the average rating
> > > which implies that each item-rating vector is treated independently of
> > each
> > > other whereas in the sequential/non-distributed mode it's all
> considered
> > > together.
> > >
> > > My main reason for posting is because the Taste implementation of
> > item-item
> > > similarity differs from the distributed implementation. Since I am
> > totally
> > > new to this space and these similarities I wanted to understand if
> there
> > is
> > > a reason for this difference and whether or not it matters. Sounds like
> > > from the discussion it doesn't matter but understanding why helps me
> > > explain this to others.
> > >
> > > My guess (and I'm glad Sebastian is on this list so he can help
> > > confirm/deny this.. sorry I'm not picking on you just happy to be able
> to
> > > talk to you about your good paper) is that considering co-occuring
> > ratings
> > > in a distributed implementation would require access to the full matrix
> > > which defeats the parallel nature of computing item-item similarity?
> > >
> > > Thanks again!
> > > Amit
> > >
> > >
> > > On Sun, Dec 1, 2013 at 2:55 AM, Sean Owen <srowen@gmail.com> wrote:
> > >
> > >> It's not an issue of how to be careful with sparsity and subtracting
> > >> means, although that's a valuable point in itself. The question is
> > >> what the mean is supposed to be.
> > >>
> > >> You can't think of missing ratings as 0 in general, and the example
> > >> here shows why: you're acting as if most movies are hated. Instead
> > >> they are excluded from the computation entirely.
> > >>
> > >> m_x should be 4.5 in the example here. That's consistent with
> > >> literature and the other implementations earlier in this project.
> > >>
> > >> I don't know the Hadoop implementation well enough, and wasn't sure
> > >> from the comments above, whether it does end up behaving as if it's
> > >> "4.5" or "3". If it's not 4.5 I would call that a bug. Items that
> > >> aren't co-rated can't meaningfully be included in this computation.
> > >>
> > >>
> > >> On Sun, Dec 1, 2013 at 8:29 AM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> > >>> Good point Amit.
> > >>>
> > >>> Not sure how much this matters.  It may be that
> > >>> PearsonCorrelationSimilarity is bad name that should be
> > >>> PearonInspiredCorrelationSimilarity.  My guess is that this
> > >> implementation
> > >>> is lifted directly from the very early recommendation literature and
> is
> > >>> reflective of the way that it was used back then.
> > >>
> > >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message