mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <>
Subject Re: Overhauled
Date Fri, 23 Apr 2010 14:25:32 GMT
I really should get my partially finished version of this in there... it
seems you guys keep converging closer and closer to my weird
matrix-triple-product way of doing it as time goes on.  :)

But yes, in general: avoiding MapFiles always helps.  Hadoop
is designed for bulk sequential access, and letting it do that
allows for maximal throughput, doing anything else is... fraught
with peril.


On Fri, Apr 23, 2010 at 2:44 AM, Sean Owen <> wrote:

> I thought it might be worth bringing this back to the user list.
> Ankur effectively raised issues about the performance of
> by adding
>, which is a similar
> recommender job (item cooccurrence-based) but with a different
> implementation. ".item" ultimately does not distribute the matrix-user
> vector multiply, and ".coocurrence" highly distributes it.
> .item accomplished this by side-loading the co-occurrence matrix into
> a reducer, by accessing it from disk as MapFiles. This way of
> accessing columns proved to be very slow.
> After much experimentation, I've completely overhauled .item by
> grafting in ideas from .cooccurrence. It is a sort of
> best-of-both-worlds hybrid of the two. It borrows a clever way to join
> two kinds of input into one MapReduce, in order to join the
> co-occurrence matrix columns and individual elements of each user
> vector. The product is output and recombined later. This hybrid
> retains features of .item like accommodating user ratings.
> Letting Hadoop manage the data flow, even though it takes a bit more
> copying, avoiding reading from MapFile in a random-access manner,
> using features like the Combiner, and being smarter about Writables
> has sped this up for me by at least a factor of 10 -- mostly that
> avoiding MapFiles.
> I bring it up since it's interesting, a good development for anyone
> using this implementation, and an area that is ripe for more testing
> and improvement I imagine.
> Sean

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message