mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Taste-GenericItemBasedRecommender
Date Sun, 06 Dec 2009 01:39:10 GMT
On Sat, Dec 5, 2009 at 5:17 PM, Sean Owen <srowen@gmail.com> wrote:

> I suggest for purposes of the project we would build implementations
> of Recommender that can consume some output from Hadoop on HDFS, like
> a SequenceFile or whatever it's called. Shouldn't be hard at all. This
> sort of hybrid approach is already what happens with slope-one -- I
> wrote some jobs to build its diffs and then you can load the output
> into SlopeOneRecommender -- which works online from there.
>

Something generic like this would be helpful, I think, as well as outputting
to a Lucene index.

I wonder how important these implementations are for the project,
> which seems like a bit of heresy -- surely Mahout needs to support
> recommendation on huge amounts of data? I think the answer's yes, but:
>
> LinkedIn and Netflix and Apple and most organizations with huge data
> to recommend from have already developed sophisticated, customized
> solutions.
>

Actually, from direct experience and conversations with principals involved,
I can tell you that you would be surprised at the unsophistication some
parts of the production systems at all three of these places (as well as,
eg.
Amazon).

Mahout could end up becoming large parts of some of their infrastructure
for doing this at some point.

Organizations with less than 100M data points or so to process don't
> need distributed architectures. They can use Mahout as-is with its
> online non-distributed recommenders pretty well. 10 lines of code and
> one big server and a day of tinkering and they have a full-on simple
> recommender engine, online or offline. And I argue that this is about
> 90% of users of the project who want recommendations.
>

Today, yes.  In a year, this number will maybe be 80%.  In 2 years - maybe
60%.  Big data is coming to smaller and smaller organizations.
Usage data doesn't need to be only internal: stuff you mine off of the web
can be used too...


> So who are these organizations that have enough data (like 1B+ data
> points) that they need something like the rocket science that LinkedIn
> needs, but can't or haven't developed such capability already
> in-house?
>
> I guess that's why I've been reluctant to engineer and complicate the
> framework to fit in offline distributed recommendation -- because this
> can become as complex as we like -- since I wonder at the 'market' for
> it. But it seems inevitable that this must exist, even if just as a
> nice clean simple reference implementation of the idea. Perhaps I
> won't go overboard on designing something complex yet here at the
> moment.
>

All of the above which I said aside: I agree that over-engineering
something to do this is not desireable.  But thinking about how we
output partially processed "matrices" for on-line recommendation
generation is something we should still do.  Maybe we're adequately
served by spitting out SequenceFiles, with a simple api for zipping
through them and producing scores using pluggable scoring functions?

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message