mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject [Taste] Sanity Check and Questions
Date Thu, 18 Jun 2009 15:29:54 GMT
I'm working on a demo on Mahout and part of it is on collab.  
filtering.   For the CF part, I'm taking the lead from an idea from  
Ted about a way to demonstrate how CF works conceptually. (Ted please  
correct me if my understanding is incorrect)

I took a subset of Wikipedia articles (2302, available at

, created by the WikipediaXMLSplitter in the example directory).   
Next, I picked a topic of interest, in this case all docs containing  
the phrase "Abraham Lincoln", and I made the assumption that there are  
10 users out of a total of 1000 who are "Lincolnphiles" and have  
thereby rated most of the articles (17 total) on the topic.  The  
ratings range between -5 and 5 (as doubles), but for the most part,  
the Lincolnphiles tend to like the same things, but to varying  
degrees.  (Note, I did these ratings by hand and thus "stacked the  
deck") The Lincolnphiles are really obsessed and did not rate any  
other documents.  However, not all of them rated all 17 articles.   
Next, I assumed the other 990 users are randomly rating across all the  
documents and in the same range.  Thus, for every article in the set,  
I randomly grabbed X users and then have them randomly assign a degree  
of like or dislike in the range mentioned.

I then implemented a basic recommender according to the Taste docs  
under User-based recommenders section.  I then pass in the user id of  
one of the Lincolnphiles.  The results I get back are a bit surprising  
in that none of the recommendations are for other items rated highly  
by the Lincolnphiles, despite the fact that, when setting the  
neighborhood to be 10, all of the other Lincolnphiles are in the  
neighborhood plus one non-Lincolnphile.  I would expect the  
recommendations to be for items that are not rated by my Lincolnphile,  
but have been rated by the other Lincolnphiles, or at least some of  
them, but in fact none of the recommendations are for Lincoln docs.

OK, so I then played around a bit with the neighborhood size.  If I  
make it 9 (which is the number of other Lincolnphiles in the system)  
or less, I then get what I expected.  So, it seems the one non- 
Lincolnphile rated a lot more items than all the Lincolnphiles.  Is  
that why that user's items seem to dominate the recommendations?  In  
looking at the non-Lincoln user, I see two common items that they both  
rated, one that they both really liked and one that they disagreed on.

I'm not exactly sure what my questions are, other than the one about  
an active user dominating like minded, but less active raters and  
what's the appropriate thing to do there, if anything, but I wanted to  
make sure this all makes sense.

Also, is there any notion in Taste similar to Lucene's explain method (,%20int)


After this sanity check, my next goal is to show how a new  
Lincolnphile coming into the system would be guided to other content  
on Lincoln.

[And yes, once done, this code will be publicly available, but it will  
be a little while]

Here's my snippet of code for recommending, pretty much verbatim from  
the Taste docs:
UserSimilarity userSimilarity = new  
// Optional:

     UserNeighborhood neighborhood =
             new NearestNUserNeighborhood(neighSize, userSimilarity,  
     Collection<User> users = neighborhood.getUserNeighborhood(userId);
     for (User neighbor : users) {
       System.out.println("Neighbor: " + neighbor);

     Recommender recommender =
             new GenericUserBasedRecommender(dataModel, neighborhood,  
     Recommender cachingRecommender = new  

     List<RecommendedItem> recommendations =
             cachingRecommender.recommend(userId, 10);
     for (RecommendedItem item : recommendations) {
       Item theItem = item.getItem();
       String title = idsToTitle.get(theItem.getID().toString());
       System.out.println("Doc Id: " + theItem + " Title: " + title);


View raw message