I think that the question here is how to generate the A' A product.
If so, then answer is to read one row of A at a time and emit all of the
product pairs that could happen from that row. The key is the row and
column in the result matrix.
More precisely:
map: i, A[i,*] => foreach j foreach k yield ((j,k), A[i,j] * A[i,k])
combine and reduce: (j,k), values => (j, k), sum(values)
The output is A'A in triple form. You might be able to do a trick with
sorting and just use j as the key. Then you could emit an entire row at a
time.
A second MR needs to produce row sums of A'A.
This product may require downsampling of nonsparse rows of A because it is
quadratic in the densest row count. I usually just use random sampling for
any row with more than several hundred entries. This is defensible because
little additional information is gained after the first hundred or so
entries.
On Thu, Dec 3, 2009 at 7:02 AM, Sean Owen <srowen@gmail.com> wrote:
> One question that maybe Ted already knows the answer to: how would I
> iterate over A_u* for each A_u*? the mapper would touch only one each.
> I guess I could use a whole MR step to generate the cartesian product
> so to speak.
>

Ted Dunning, CTO
DeepDyve
