mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Recommending when working with binary data sets
Date Tue, 30 Sep 2008 18:55:10 GMT
For the sake of learning I'll ignore that patent (and am aware of Doug's stand on this and
like it)

One thing that's not clear to me in that Amazon item-item recommendation article is what exactly
happens at the moment when recommendation is needed.  They "explain" that part in a single
sentence: "Given a similar-items table, the algorithm  finds items similar to each of the
user's purchases and ratings,  aggregates those items, and then recommends the most popular
or  correlated items."

If I understood their item-item similarity computation, this is what first happens off-line
(example):
* catalogue has items {1, 2, 3 ...100}
* say there is only 1 user named Joe who made any purchases and over time he bought items
{1, 10, 20}

I think their algo then translates to:

for item i1 in {1, 2, 3 ...100} {
  for item i2 in {1, 10, 20} {
    computeSimilarity(i1, i2)
  }
}

sim (1, 1) = 1.0
sim (1, 10) = 0.7  // let's say
sim (1, 20) = 0.8  // let's say
sim (10, 1) = same as sim (1,10)
sim (10, 10) = 1
sim (10, 20) = 0.9 // let's say
sim (20, 1) = same as sim (1,20)
sim (20, 10) = same as sim(10,20)
sim (20, 20) = 1

So, then we have a list of most similarity items for each item:
item 1: 20, 10
item 10: 20, 1
item 20: 10, 1

Is this interpretation correct?  If it is, I don't quite yet see the benefit of computing
similarity between items for a single person.  Of course they will be similar, unless we want
to create groups of items for a single person with a diverse set of interests for some reason.

Assuming the above interpretatoin is correct, what happens when Joe comes to a site and the
recommendation engine needs to find something to recommend to him?  Say that since Joe's last
visit the catalogue expanded and now also has items 101-200 and the RE needs to figure out
N best items to recommend from the set of new items (101-200).
What does RE do at that point?

All that's said in the article is:
"Rather than matching the user to similar customers,  item-to-item collaborative filtering
matches each of the user's  purchased and rated items to similar items, then combines those
 similar items into a recommendation list."
"Given a similar-items table, the algorithm  finds items similar to
each of the user's purchases and ratings,  aggregates those items, and
then recommends the most popular or  correlated items."

Wouldn't one need to build item-item similarity matrix for all Joe's items (1, 10, 20) and
all new items (101-200)?
If so, what's the point of figuring item-item similarity for all items Joe already consumed?

Thanks,
Otis


----- Original Message ----
> From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
> To: mahout-user@lucene.apache.org
> Sent: Tuesday, September 30, 2008 1:09:08 PM
> Subject: Re: Recommending when working with binary data sets
> 
> Hello,
> 
> Thanks for the pointers, Grant.  Regarding that Amazon item-item 
> recommendation.  It looks like that's patented:
> 
> http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=2&p=1&f=G&l=50&d=PTXT&S1=amazon.ASNM.&OS=AN/(amazon)&RS=AN/amazon
> 
> Does that mean one cannot implement this in Taste (or any other piece of 
> software)?  Even if used in non-shopping purposes?
> 
> 
> Thanks,
> Otis
> 
> 
> ----- Original Message ----
> > From: Grant Ingersoll 
> > To: mahout-user@lucene.apache.org
> > Sent: Monday, September 29, 2008 9:43:58 AM
> > Subject: Re: Recommending when working with binary data sets
> > 
> > Not sure I know the answer in terms of Taste, but did a little bit of  
> > digging (mind you, I'm no CF expert, but I'm learning thanks to Taste  
> > and Sean).
> > 
> > At any rate, came across:
> > Started at Wikipedia's page: 
> > http://en.wikipedia.org/wiki/Collaborative_filtering
> > 
> > which lead to http://en.wikipedia.org/wiki/Slope_One, which then has  
> > an interesting comment about Amazon's item-item approach, which, via  
> > Google Scholar leads to:
> > 
> > 
> http://dsonline.computer.org/portal/site/dsonline/menuitem.9ed3d9924aeb0dcd82ccc6716bbe36ec/index.jsp?&pName=dso_level1&path=dsonline/2003_Archives/0301/d&file=wp1lind.xml&xsl=article.xsl&;jsessionid=LghY1grHgYJpBTLpWjX5NtvQwhH1Bkv9rpfXT4VnpVtDNVpfZ8n0!-1404507079
> > 
> > In particular, see the "How it Works" section.  Essentially, it  
> > describes how they build the item to item similarity matrix, which I  
> > believe is also what you need to do.
> > 
> > HTH,
> > Grant
> > 
> > On Sep 26, 2008, at 1:52 PM, Otis Gospodnetic wrote:
> > 
> > > Hi,
> > >
> > > I've been reading the chapter on recommendations in Programming  
> > > Collective Intelligence and looking at Taste.  The examples in PCI  
> > > all assume scenarios where items to recommend have been rated by  
> > > users on some scale.  I understand how items can be recommended to  
> > > users using item-based filtering and user-item ratings and why this  
> > > is preferred over user-based filtering when the number of users is  
> > > larger than the number of items.
> > > But what if all I've got is item-item similarity (content-based) and  
> > > there are no user-item ratings?  Say I have a situation where people  
> > > simply either consume content (e.g. read an article, watch a  
> > > video...) or not consume it (don't read an article, don't watch the  
> > > video...).  In other words, I really have only yes/no or 1/0 or seen/ 
> > > not seen type "rating".
> > >
> > > I can't really use Euclidean distance or Pearson correlation  
> > > coefficient, can I?
> > >
> > > What do people use in such scenarios?  Would it make sense to use 
> > http://en.wikipedia.org/wiki/Jaccard_index 
> > >  for such cases?
> > > ... Ah, I do see javadoc in TanimotoCoefficientSimilarity saying  
> > > exactly that, good.
> > >
> > > But then my question is:
> > > Doesn't the use of Jaccard/Tanimoto mean going back to the expensive  
> > > user-user similarity computation?
> > >
> > > That is, if I need to recommend items for user U1 don't I need to:
> > > 1) have user-user similarity pre-computed (and recomputed  
> > > periodically)
> > > 2) find top N users U{2,3,4,...N} who are the most similar to U1
> > > 3) then for these top N most similar users find their "seen" items  
> > > that U1 has not seen (possibly limit this to only recently seen items)
> > > 4) select top N items from 3) and recommend those to U1.
> > >
> > > If so, then 1) is again expensive.
> > > And what how would one go about selecting top N items from the list  
> > > in this case other than ordering them by user-user similarity?
> > >
> > > Of course, something is telling me I'm demonstrating that I don't  
> > > yet have the full grasp of item-based filtering.  I hope that's the  
> > > case! :)
> > >
> > > Thanks,
> > > Otis
> > 
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com
> > 
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ


Mime
View raw message