mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Taste-GenericItemBasedRecommender
Date Sun, 06 Dec 2009 00:27:45 GMT
On Sat, Dec 5, 2009 at 10:27 PM, Ted Dunning <> wrote:
> Post your code and I will add the down-sampling.

Yep, I understand the point. The code is already in SVN under It's still in progress --
doesn't work at the moment, but you should see all the major parts.

> Also, I typically discard counts from the user vector because that improves
> resistance to spammers.  If you do that, then the counts you are producing
> will all be 1 (cooccurrence for a single user).  That means that the
> combiner and reducer will still be required.

Yes, I don't even output the 1, just the item -> item mapping. It's
the simplest place to start.

> I may be misunderstanding what you are saying.  You may be saying that you
> will be accumulating all the cooccurrence data for all users in memory.
> That seems like a bad idea given that the combiner/reducer approach is
> pretty darned fast.

No it's never all in memory, but all coocurrence for one item would be
in memory. The reducer would get the item -> item mappings -- really,
and item -> iterator-over-items. And then from that you construct a
row of the cooccurrence matrix. I think it's leveraging the combiner
just fine as this is what's collecting all the cooccurrence for one
item into one place.

> The first approach requires access to all rows of M, but does the dot
> products using the sparse pattern of v.  The second approach inverts the
> nesting of the loops and moves the use of sparsity of v to the outer loop.
> This allows us to access only a small number of columns of M which is a huge
> win even if A is entirely in memory.  Secondarily, it also effectively lifts
> the highly repeated determination of the sparsity pattern of v out of the
> inner loop resulting in additional savings.

Yeah I get it, that does make a good bit of sense. That is a big win.

> I say this because user histories change in real-time and it is common to
> require real-time updates to recommendations.  Since the recommendation is
> so very cheap to compute it seems reasonable to do it at real time rather
> than pay the cost of storing all recommendations for all users until they
> need them.  Moveover, since the recommendation is more dense than either the
> history or the columns of (A'A) and because many common columns of A'A will
> be cached in memory, the I/O cost isn't even all that different.  That means
> that total cost of just retrieving a recommendation is comparable to the
> cost of computing it on the fly.

Yes I get this too, and makes sense as well.

The biggest pain for me here is how to rationalize all of this into an
API. The current code is completely online. Now I'm dropping in a
truly offline/distributed version, which is a totally different
ballgame. And then there are all these hybrid approaches, computing
some stuff offline and some online and requiring real-time integration
with HDFS. I'm trying to incorporate it all in a way that isn't chaos,
since I'm trying to write a chapter on this now as well. Will have to
come up with a theory and a plan to implement it.

View raw message