mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Solr-recommender
Date Thu, 10 Oct 2013 15:59:42 GMT
The issue of offline tests is often misunderstood I suspect. While I agree with Ted it might
do to explain a bit.

For myself I'd say offline testing is a requirement but not for comparing two disparate recommenders.
Companies like Amazon and Netflix, as well as others on record, have a workflow that includes
offline testing and comparison against previous versions of their own code and their own gold
data set. These comparisons can be quite useful, if only in pointing to otherwise obscure
bugs. If they see a difference in two offline tests they ask, why? Then when they think they
have an optimal solution they do A/B tests as challenger/champion competitions and it's these
that are the only reliable measure of goodness.

I do agree that comparing two recommenders with offline tests is dubious at best, as the paper
points out. But put yourself in the place of a company new to recommenders who has several
to choose from. Maybe even versions of the same recommender with different tuning parameters.
Do the offline tests with a standard set of your own data and pick the best to start with.
What other choice do you have? Maybe flexibility or architecture trumps the offline tests,
if not then using them is better than a random choice. Take this result with a grain of salt
though and get ready to A/B test later challengers when or if you have time.

In the case of the Solr recommender it is extremely flexible and online (realtime results).
These features for me trump any offline tests against alternatives. But the demo site will
include offline Mahout recommendations for comparison, and in the unlikely event that it gets
any traffic, will incorporate A/B tests.

On Oct 9, 2013, at 4:29 PM, Ted Dunning <> wrote:

On Wed, Oct 9, 2013 at 12:54 PM, Michael Sokolov <> wrote:

BTW lest we forget this does not imply the Solr-recommender is better than Myrrix or the Mahout-only
recommenders. There needs to be some careful comparison of results. Michael, did you do offline
or A/B tests during your implementation?

I ran some offline tests using our historical data, but I don't have a lot of faith in these
beyond the fact they indicate we didn't make any obvious implementation errors.  We haven't
attempted A/B testing yet since our site is so new, and we need to get a meaningful baseline
going and sort out a lot of other more pressing issues on the site - recommendations are only
one piece, albeit an important one.

Actually there was an interesting idea for an article posted recently about the difficulty
of comparing results across systems in this field:
but that's no excuse not to do better.  I'll certainly share when I know more :)

I tend to be a pessimist with regard to off-line evaluation.  It is fine to do, but if a system
is anywhere near best, I think that it should be considered for A/B testing.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message