mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: SSVD Wrong Singular Vectors
Date Sun, 02 Sep 2012 05:10:47 GMT
On Sun, Sep 2, 2012 at 12:26 AM, Ahmed Elgohary <aagohary@gmail.com> wrote:

> - I am using k = 30 and p = 2 so (k+p)<99 (Rank(A))
> - I am attaching the csv file of the matrix A
>

Brilliant.  And the attachment actually made it through.


> - yes, the difference is significant. Here is the output of the sequential
> SSVD:
>

Indeed.  The difference in the sequential case looks very big.

Is it possible that you have nearly identical singular values?  This could
leave the singular vectors under-determined.

I just looked at this using R.  Since it uses LAPACK just like Matlab, the
singular vectors I got are very similar to yours.

> s$u[1,1:10]
 [1] -0.09866006 -0.13195226  0.16620985  0.04921311 -0.18280221
 0.11559161 -0.06775417  0.05038383
 [9] -0.01597058  0.03498753


But look at the spectrum:

> s$d[1:30]
 [1] 49.944662  5.396441  5.363498  5.162772  5.124997  5.031488  4.952310
 4.837590  4.789828  4.718988
[11]  4.673337  4.576722  4.489299  4.464587  4.423230  4.349272  4.233963
 4.069206  4.061468  4.028808
[21]  3.977288  3.927101  3.867809  3.771824  3.715605  3.646800  3.522127
 3.494618  3.446958  3.412861


After the first singular value, you have lots of nearly identical values.
 This indicates that you probably need to use a power method to
pre-condition the data.

In a large-scale application, it is typically a much longer ways before you
hit this flat plateau.  This makes the SSVD much more economical.

Also, if I step through the basic algorithm using R, it is easy to see that
the basic assumptions of the stochastic projection are violated.

> Y = A %*% matrix(rnorm(3200), 100, 32)
> qr1 = qr(Y)
> norm(qr.Q(qr1) %*% t(qr.Q(qr1)) %*% A - A)
[1] 20.83509
> norm(A)
[1] 58.34794


The issue here is that the random projection is not a very good
approximation of A.  THis happens if the singular value spectrum is still
large at the cutoff:

> s$d[30:99]
 [1] 3.41286101 3.38844993 3.32769425 3.27354388 3.20851301 3.16175529
3.08556715 3.00915717 2.95061044
[10] 2.92376037 2.87917470 2.82316676 2.75040129 2.66700067 2.61057954
2.58533645 2.53817660 2.49961198
[19] 2.45114845 2.37234167 2.32895166 2.28347210 2.25993598 2.25411242
2.20475212 2.14491854 2.06434148
[28] 2.05223478 1.98081081 1.94691237 1.92036956 1.83571526 1.75956488
1.69612907 1.67369404 1.58851679
[37] 1.54144415 1.51821081 1.45912647 1.45063011 1.41732925 1.35655483
1.31333389 1.27988382 1.18840667
[46] 1.14900347 1.10145851 1.08105797 1.04054897 0.96239289 0.94764227
0.86479263 0.82724565 0.79683667
[55] 0.71886437 0.68518593 0.63232310 0.57656816 0.57076929 0.51649318
0.43080840 0.39319166 0.37404507
[64] 0.34126503 0.31046386 0.25922183 0.17239248 0.10925979 0.07458005
0.02792311


So this is the basic problem.

If we do even just one or two power iterations things look vats better

> norm(qr.Q(qr1) %*% t(qr.Q(qr1)) %*% A - A)/norm(A)
[1] 0.3570835
> A1 = A %*% t(A) %*% A
> Y1 = A1 %*% matrix(rnorm(3200), 100, 32)
> qr1.1 = qr(A1)
> norm(qr.Q(qr1.1) %*% t(qr.Q(qr1.1)) %*% A1 - A1)/norm(A1)
[1] 6.076296e-16
> qr1.1 = qr(Y1)
> norm(qr.Q(qr1.1) %*% t(qr.Q(qr1.1)) %*% A1 - A1)/norm(A1)
[1] 0.00156115
> A2 = A %*% t(A) %*% A %*% t(A) %*% A
> Y2 = A1 %*% matrix(rnorm(3200), 100, 32)
> Y2 = A2 %*% matrix(rnorm(3200), 100, 32)
> qr1.2 = qr(Y2)
> norm(qr.Q(qr1.2) %*% t(qr.Q(qr1.2)) %*% A2 - A2)/norm(A2)
[1] 8.318334e-06


But this is just going to help find the larger singular vectors... the
reconstruction of A is still going to be problematic.

For a more interesting test of the algorithm within the domain it is liable
to work, I would recommend starting with a definition of A that has known
singular values.  Pick U and V at random, engineer S as desired and then
try this again.

And remember that SSVD isn't intended for small problems.


> I am not sure what you mean:
> "Did you account for the fact that your matrix is small enough that it
> probably wasn't divided correctly?"
>

I was worried that data might not get well split between mappers.  That
isn't your problem.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message