On Sat, Dec 5, 2009 at 1:42 AM, Sean Owen <srowen@gmail.com> wrote:
> On Fri, Dec 4, 2009 at 7:35 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> > The preferable approach is for the first MR step to group by user as
> before,
> > then in the reduce downsample the user items if desired and output that
> > list in a single record. Downsampling can be done online keeping just
> the
> > retained elements in memory. Second MR would produce the cross product
> in
> > the mapper and use a combiner and reducer.
>
> That's what I'm doing  outputting a Vector per user in the first MR.
> (I'm leaving out the extras like downsampling until the basic approach
> works.)
>
If any users have looked at more than a 1000 items, downsampling becomes
very, very important. The problem is that the number of nonzero summands
for a user is proportional to the square of the number of items for that
user. The most active users quickly dominate the total cost.
If your user population follows Zipf's law exactly, this causes the A'A
computation to asymptotically cost quadratic time in the number of users.
In practice, the situation is slightly better because the most active 1% of
all users are not as active as Zipf would predict, but the computation cost
is still substantially superlinear and thus is not scalable even at
relatively moderate numbers of users.
Post your code and I will add the downsampling.
>
> I think I'm going a different way to produce the cooccurrence matrix 
> no cross product, just counting and outputting all cooccurrence, and
> outputting item1ID > item2ID as keyvalue pairs. That makes it tidy
> to produce the rows of the cooccurrence matrix in the reducer.
>
This is essentially what the combiner would be doing. Your approach may be
a bit faster since the data is moving less. The merge sorting used by
hadoop may improve locality which could offset your advantage, depending on
how large the cross product is.
Also, I typically discard counts from the user vector because that improves
resistance to spammers. If you do that, then the counts you are producing
will all be 1 (cooccurrence for a single user). That means that the
combiner and reducer will still be required.
I may be misunderstanding what you are saying. You may be saying that you
will be accumulating all the cooccurrence data for all users in memory.
That seems like a bad idea given that the combiner/reducer approach is
pretty darned fast.
> Another approach is to make each column of A'A be stored in a keyvalue
> > store. At recommendation time, you retrieve columns and add them. This
> is
> > essentially equivalent to the Lucene approach without lucene. Because we
> > know a lot about the contents (they are integers), you can probably write
> > tighter code than Lucene can use. This would be a great use for the
> fancy
> > concurrent map builder that is in Google collections, for instance.
>
> Sounds cool but don't I need the rows of A'A to multiply against h? h
> is a column vector.
>
You can view matrix vector multiplication in two different ways depending on
how you nest the loops. One way (what you are talking about) is to say that
the result is a vector of dot products. This corresponds to this pseudo
code:
for (i in 1:n) {
r[i] = A.row(n).dot(v)
}
In this approach, the inner loop is the dot product and we require rowwise
access to M
The other way to look at it is as the weighted sum of vectors.
r = zero(n)
for (j in 1:m) {
r += v[j] * M.column[j]
}
Here the inner loop is the addition to the result and we require columnwise
access to M.
This second approach can be rewritten, though,
r = zero(n)
for (j in v.nonZeros()) {
r += v[j] * M.column[j]
}
The first approach requires access to all rows of M, but does the dot
products using the sparse pattern of v. The second approach inverts the
nesting of the loops and moves the use of sparsity of v to the outer loop.
This allows us to access only a small number of columns of M which is a huge
win even if A is entirely in memory. Secondarily, it also effectively lifts
the highly repeated determination of the sparsity pattern of v out of the
inner loop resulting in additional savings.
> Also why did you later say recommendation must occur online? seems
> quite doable offline and my picture of the point of this whole Hadoop
> framework is doing things offline. They've already gone to the trouble
> of running a cluster and have given up doing it entirely online, so...
>
I say this because user histories change in realtime and it is common to
require realtime updates to recommendations. Since the recommendation is
so very cheap to compute it seems reasonable to do it at real time rather
than pay the cost of storing all recommendations for all users until they
need them. Moveover, since the recommendation is more dense than either the
history or the columns of (A'A) and because many common columns of A'A will
be cached in memory, the I/O cost isn't even all that different. That means
that total cost of just retrieving a recommendation is comparable to the
cost of computing it on the fly.

Ted Dunning, CTO
DeepDyve
