From Sean Owen <>
Subject Re: Taste-GenericItemBasedRecommender
Date Sun, 06 Dec 2009 01:17:27 GMT
I suggest for purposes of the project we would build implementations
of Recommender that can consume some output from Hadoop on HDFS, like
a SequenceFile or whatever it's called. Shouldn't be hard at all. This
sort of hybrid approach is already what happens with slope-one -- I
wrote some jobs to build its diffs and then you can load the output
into SlopeOneRecommender -- which works online from there.

At least then the "hybrid" offline/online recommenders aren't yet a
third species of recommender in the framework. Perhaps there isn't
even a need for fully offline recommenders? Just jobs that can produce
supporting intermediate output for online recommenders? That'd be
tidier still.

If I may digress --

I wonder how important these implementations are for the project,
which seems like a bit of heresy -- surely Mahout needs to support
recommendation on huge amounts of data? I think the answer's yes, but:

LinkedIn and Netflix and Apple and most organizations with huge data
to recommend from have already developed sophisticated, customized

Organizations with less than 100M data points or so to process don't
need distributed architectures. They can use Mahout as-is with its
online non-distributed recommenders pretty well. 10 lines of code and
one big server and a day of tinkering and they have a full-on simple
recommender engine, online or offline. And I argue that this is about
90% of users of the project who want recommendations.

So who are these organizations that have enough data (like 1B+ data
points) that they need something like the rocket science that LinkedIn
needs, but can't or haven't developed such capability already

I guess that's why I've been reluctant to engineer and complicate the
framework to fit in offline distributed recommendation -- because this
can become as complex as we like -- since I wonder at the 'market' for
it. But it seems inevitable that this must exist, even if just as a
nice clean simple reference implementation of the idea. Perhaps I
won't go overboard on designing something complex yet here at the

On Sun, Dec 6, 2009 at 12:43 AM, Jake Mannix <> wrote:
> But having a nice api for *outputting* the precomputed matrices which
> are pretty big into a format where online "queries"/recommendation
> requests can be computed I think is really key here.   We should think
> much more about what makes the most sense here.

