mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Overhauled org.apache.mahout.cf.taste.hadoop.item
Date Fri, 23 Apr 2010 14:25:32 GMT
I really should get my partially finished version of this in there... it
seems you guys keep converging closer and closer to my weird
matrix-triple-product way of doing it as time goes on.  :)

But yes, in general: avoiding MapFiles always helps.  Hadoop
is designed for bulk sequential access, and letting it do that
allows for maximal throughput, doing anything else is... fraught
with peril.

  -jake

On Fri, Apr 23, 2010 at 2:44 AM, Sean Owen <srowen@gmail.com> wrote:

> I thought it might be worth bringing this back to the user list.
>
> Ankur effectively raised issues about the performance of
> org.apache.mahout.cf.taste.hadoop.item by adding
> org.apache.mahout.cf.taste.hadoop.cooccurrence, which is a similar
> recommender job (item cooccurrence-based) but with a different
> implementation. ".item" ultimately does not distribute the matrix-user
> vector multiply, and ".coocurrence" highly distributes it.
>
> .item accomplished this by side-loading the co-occurrence matrix into
> a reducer, by accessing it from disk as MapFiles. This way of
> accessing columns proved to be very slow.
>
> After much experimentation, I've completely overhauled .item by
> grafting in ideas from .cooccurrence. It is a sort of
> best-of-both-worlds hybrid of the two. It borrows a clever way to join
> two kinds of input into one MapReduce, in order to join the
> co-occurrence matrix columns and individual elements of each user
> vector. The product is output and recombined later. This hybrid
> retains features of .item like accommodating user ratings.
>
> Letting Hadoop manage the data flow, even though it takes a bit more
> copying, avoiding reading from MapFile in a random-access manner,
> using features like the Combiner, and being smarter about Writables
> has sped this up for me by at least a factor of 10 -- mostly that
> avoiding MapFiles.
>
> I bring it up since it's interesting, a good development for anyone
> using this implementation, and an area that is ripe for more testing
> and improvement I imagine.
>
> Sean
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message