I'm going ahead with implementing something like this as a simple but
real distributed implementation.
One question that maybe Ted already knows the answer to: how would I
iterate over A_u* for each A_u*? the mapper would touch only one each.
I guess I could use a whole MR step to generate the cartesian product
so to speak.
On Thu, Sep 10, 2009 at 1:44 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Another implementation is to iterate over users emitting cooccurrence
> pairs. These can be sorted on the items involved and the counts
> aggregated. This is ideal for map reduce implementation.
>
> Roughly the matrix approach becomes this:
>
> Map:
> input record is A_u*, all items for a particular user u
> for each item i in A_u*
> yield ((
> yield ((i,*), 1)
> for each item j in A_u*
> yield ((i, j), 1)
>
> Reduce:
> input record is (i,j), (k1, k2, ... kn)
> yield (i,j), sum k*
>
> or
>
> input record is (i,*), (k1, k2, ... ,kn)
> yield (i,*) sum k*
>
> This computes the elements of A'A (item cooccurrence counts) and the row
> sums of A'A in one step. You might want to compute the column sums of A
> first to use these for weighting A before using this program.
>
> Then you need another MR step to bring together the overall total, the row
> counts and the cell counts for final reduction.
>
> This is inside out compared with the most obvious implementation and there
> is no place where we really have two entire item vectors. This will make it
> rather harder to have a totally simple pluggable weighting scheme that runs
> as fast.
