mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Taste-GenericItemBasedRecommender
Date Fri, 04 Dec 2009 19:35:10 GMT
On Fri, Dec 4, 2009 at 12:33 AM, Sean Owen <srowen@gmail.com> wrote:

> Yes, this makes sense. I do need two passes. One pass converts input
> from "user,item,rating" triples into user vectors. Then the second
> step builds the co-occurrence A'A product. I agree that it will be
> faster to take a shortcut than properly compute A'A.
>

You can work directly with the triples as input if you have relatively small
input.  Mapper puts user as key, and (item) as value (this is cooccurrence
only so rating should only be used to filter which items get output).
Reducer takes list of items and produces cross product on output.  It
doesn't need to keep the full cross product.  Second MR positions (item1,
item2) as key and basically does a word count.

This approach has the problem of a large, uncombinable intermediate file.

The preferable approach is for the first MR step to group by user as before,
then in the reduce down-sample the user items if desired and output that
list in a single record.  Down-sampling can be done on-line keeping just the
retained elements in memory.  Second MR would produce the cross product in
the mapper and use a combiner and reducer.

If we wanted to follow up on Jake's idea, the transpose can be done in the
first MR step but it is hard to down-sample the user's items that way.

The first step of converting triples to vectors should also produce user and
item counts so that the second step can do the sparsification cleanly.


> (Though I'm curious how this works -- looks deceptively easy, this
> outer product approach. Isn't v cross v potentially huge? or likely to
> be sparse enough to not matter)
>

The cost is dominated by the square of the number of items that the most
prolific user has.  If you down sample that to something reasonable, then
you bound the total cost very tightly.  It is very unusual for the massively
prolific users to add any information ... generally they are spammers or
bots anyway.


> I understand the final step in principle, which is to compute (A'A)h.
> But I keep guessing A'A is too big to fit in memory?


Correct.  (A'A) h can be computed in several ways, but it all comes down to
the fact that h is very sparse.  Typically you make it even sparser by
keeping only recent history.  If you have only 50 non-zeros in h, then you
only need 50 columns of (A'A).  These can be retrieved many different ways,
but one cool way is to make each row of A'A be a Lucene document.  The terms
in the documents are items and the columns of A'A are the posting vectors in
Lucene.  The weighting that Lucene does generally helps but can easily be
defeated if desired.

Another approach is to make each column of A'A be stored in a key-value
store.  At recommendation time, you retrieve columns and add them.  This is
essentially equivalent to the Lucene approach without lucene.  Because we
know a lot about the contents (they are integers), you can probably write
tighter code than Lucene can use.  This would be a great use for the fancy
concurrent map builder that is in Google collections, for instance.


-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message