On Wed, Jun 9, 2010 at 12:08 PM, Richard Simon Just <
info@richardsimonjust.co.uk> wrote:
>
> Agreed. Just wanted to answer the question that had been left hanging for
> why Sarwar add the row average back. In fact to be complete, before the SVD
> they fill each null value with 'column average  row average'. But yeah,
> that would make for a much bigger computation.
>
I should note here, that if you really want to see the difference between
the fully centered SVD and the one where you only shift the nonzero entries,
the former is not *necessarily* a dense and thereforeoutofreach
computation:
The meat of the Lanczos algorithm is repeatedly computing A . v, where A is
your big big matrix, and v is a (dense) vector which is getting closer and
closer to the biggest singular. If you wanted to compute the singular
vectors of (A  B), where B_ij = r_i  c_j = (sum_{k=1..N} A_ik) 
(sum_{k=1..M} A_kj) is the matrix of row averages minus column averages,
then you can just do it in a couple of (still sparse) steps:
[(A  B) . v]_i = [A . v]_i  [B . v]_i
= [A . v]_i  (sum_{k=1..N,j=1..M}A_ij v_k) +
(sum_{k=1...M,j=1...M}A_kj v_j)
= [A . v]_i  r_i * (sum_{k} v_k) + (sum_{k} (A .
v)_k)
The second term is just the rowaverage of rowi times v.zSum(), and the
last term is a constant for all values in the output vector, and is just (A
. v).zSum().
So no additional mapreduces are needed to incorporate this  it's just that
the Lanczos solver will need to be told about the row and column averages of
the input and optionally at each Lanczos iteration, modify the resultant
vector as above.
See if it makes a difference!
jake
