mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Taste-GenericItemBasedRecommender
Date Sat, 05 Dec 2009 09:42:10 GMT
On Fri, Dec 4, 2009 at 7:35 PM, Ted Dunning <> wrote:
> The preferable approach is for the first MR step to group by user as before,
> then in the reduce down-sample the user items if desired and output that
> list in a single record.  Down-sampling can be done on-line keeping just the
> retained elements in memory.  Second MR would produce the cross product in
> the mapper and use a combiner and reducer.

That's what I'm doing -- outputting a Vector per user in the first MR.
(I'm leaving out the extras like downsampling until the basic approach works.)

I think I'm going a different way to produce the cooccurrence matrix -
no cross product, just counting and outputting all cooccurrence, and
outputting item1ID -> item2ID as key-value pairs. That makes it tidy
to produce the rows of the cooccurrence matrix in the reducer.

> Correct.  (A'A) h can be computed in several ways, but it all comes down to
> the fact that h is very sparse.  Typically you make it even sparser by
> keeping only recent history.  If you have only 50 non-zeros in h, then you
> only need 50 columns of (A'A).  These can be retrieved many different ways,
> but one cool way is to make each row of A'A be a Lucene document.  The terms
> in the documents are items and the columns of A'A are the posting vectors in
> Lucene.  The weighting that Lucene does generally helps but can easily be
> defeated if desired.

I'll hold off leveraging Lucene for later. I'll also probably start by
just loading the whole row but yeah that's not quite efficient. The
other optimizations you mention later also make sense.

> Another approach is to make each column of A'A be stored in a key-value
> store.  At recommendation time, you retrieve columns and add them.  This is
> essentially equivalent to the Lucene approach without lucene.  Because we
> know a lot about the contents (they are integers), you can probably write
> tighter code than Lucene can use.  This would be a great use for the fancy
> concurrent map builder that is in Google collections, for instance.

Sounds cool but don't I need the rows of A'A to multiply against h? h
is a column vector.

Also why did you later say recommendation must occur online? seems
quite doable offline and my picture of the point of this whole Hadoop
framework is doing things offline. They've already gone to the trouble
of running a cluster and have given up doing it entirely online, so...

View raw message