mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johannes Schulte <johannes.schu...@gmail.com>
Subject Re: Mix of Content Based and Collaborative Filtering
Date Wed, 07 Nov 2012 08:49:56 GMT
Indeed I haven't looked at your "all time classic blog entry"(
http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html) in a
while, and now there's a discussion about exactly this in the comments. I
had the same question on my mind appearing there: Why does Mahouts
LogLikelihoodSimilarity treat the LLR Scores as Similarities then? I just
looked into 0-8-SNAPSHOTS and the loglikelihood scores are just summed up
(which is the same as we're doing it by the way...)

For the getting back to the original intention of the post, i think it
might be helpful to distinguish what types of "signals" can be used in such
a system, meaning which "relations" are possible to incorporate in such a
system:

true user features:
- items he interacts with
- search terms
- profile information (region, location, etc.)

derived user features (in the item space):
- take the interactions with items and construct features from the item
content (e.g. with tf/idf or even LLR)
(The question for the "MoreLikeThis" feature points in the direction)

all those features could be expressed in "a associates with b" and combined
for a single request. Some of the user features might also be transferrable
to the item space (like search terms, region) and be used as "real" search
terms.

This was intended to make something more clear, I think it doesnt :|.

I think the most important thing to note is that the "cross feature
association rule style" is something very rarely mentioned/considered when
talking about recommender systems in the classic way(rating, item,
similarity..), despite of its usefulness and simplicity.


On Wed, Nov 7, 2012 at 12:42 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> We recently helped a client do this and actually got higher relevance than
> scores that had been "fought for".  That doesn't mean your scores will fare
> similarly, but I really think that the benefit of getting content signals
> into the same framework really outweighed any cleverness on the part of the
> handling of the traditional collaborative signals.
>
> LLR scores should essentially never be used as weights, but rather should
> only be used as filters.  I had several examples of this in my dissertation
> with the most notable being the way that the document routing using LLR as
> a filter worked better than the learned scores for the same task.
>  Ironically, the LLR-filter ran on the document retrieval version of the
> routing system that it beat.  Whether there would have been a better way to
> use LLR as a filter and compute sophisticated weights for the surviving
> terms isn't a question I asked and lots of water has flowed under the
> bridge since then.
>
> There are have been lots of replications of LLR-is-better-for-filtering
> result over the years and, as far as I know, no refutations.
>
> On Tue, Nov 6, 2012 at 12:42 PM, Johannes Schulte <
> johannes.schulte@gmail.com> wrote:
>
> > Maybe I'll try it out to throw the scores away we fought so hard for.
> > You're right, mixing vector space model score and LLR is questionable
> > without more sophisticated methods.
> > Thanks for the answers!
> >
> >
> >
> >
> >
> > On Tue, Nov 6, 2012 at 5:44 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >
> > > On Mon, Nov 5, 2012 at 9:16 PM, Johannes Schulte <
> > > johannes.schulte@gmail.com
> > > > wrote:
> > >
> > > >
> > > > is it possible you are mixing up payloads and stored fields? The
> latter
> > > > ones are not indexed and can only be used for the top n results.
> Maybe
> > > > we're talking about different things..
> > > >
> > >
> > > I think I did mix these up.  I haven't been active with Lucene for some
> > > time.
> > >
> > >
> > > > With the question of how to include the similarities I was actually
> > > asking
> > > > for the way to include the scores of say a LLR value into an index.
> Do
> > > you
> > > > just take the top x related items and throw the similarity score
> away?
> > > >
> > >
> > > LLR is not a good score for weighting.  It is an excellent score for
> > > filtering.  So yes, I just take the top few hundred related items and
> > throw
> > > away the similarity score.
> > >
> > > Sebastian has demonstrated that trimming the related objects this way
> has
> > > no perceptible effects, but if you have content relations as well, you
> > get
> > > even more assurance that you will get some kind of reasonable
> > > recommendations.
> > >
> > >
> > > > As for the performance: Yes, sorry, that was a little bragging and
> not
> > > > really informative :) .
> > > >
> > >
> > > Very informative actually.  The performance is what made it clear that
> I
> > > was confused.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message