mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Asymmetric training and query in recommenders
Date Tue, 25 Feb 2014 17:19:42 GMT
Recommenders need user preference data. The more the better, right? Well, yes and no…

Assuming you have a catalog that may have things added often but older items also remain in
stock for some time. Training of user preference data over a fairly long time period will
likely be a good thing. But this user history of everything, may not be the best query to
use for returning recs.

Using an offline precision metric (MAP@n) and real ecommerce data we build Mahout recommender
models on 3, 6, 9, and 12 months of data. We held out the most recent 10% for testing the
recommender’s predictions. As one would expect the more data the better. But I think there
is a hidden problem in this.

Using a user’s entire history may not lead to the best recs for today. The intuition is
that the most recent n actions should be used for making recs, thereby capturing the user’s
current intent.

Unfortunately Mahout’s recommenders use the same data to build the “indicator matrix”
as they do to make the query for returning recs.

Current Mahout:
B = history of all preferences by all users
Mahout calculates recs by doing 
[B’B]B' = R, where [B’B] is actually the product of the RowSimilarityJob and so is an
“indicator matrix” not just a cooccurrence matrix. I always use Log likelihood or LLR
in the RSJ so [B’B] is to be seen as shorthand for this.

The problem with this approach is that B is the only input and therefore used for the query
as well as the training.

Using the Solr+Mahout recommender--where the query is in realtime and the training occurs
periodically in the background--solves this problem nicely. The indicator matrix is produced
on as much data as possible but there is no requirement that all of that be used in the query.
For the Solr+Mahout recommender I’d rather say:
[B’B]h = R, where h is a user's history going back as far as you think good and B is as
much data as makes sense for your catalog. Picking h is probably done by taking the most recent
n actions/prefs rather than a point in time cutoff because different people are more active
than others.

I think this indicates an improvement that could be made to Mahout’s recommender. Either
B and H can be supplied separately or we can leave the query to Solr.

Anyone have an opinion?
View raw message