mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: problem Interpreting SVD values
Date Sun, 18 Oct 2009 17:28:10 GMT
I have not worked with lingpipe, but ...

When I follow the steps you are taking using R, I get this:

*> docs=data.frame(d0=c(2,2,0,0), d1=c(2,2,0,0), d2=c(0,0,2,2),
row.names=c("t0","t1","t2","t3"))
> docs
   d0 d1 d2
t0  2  2  0
t1  2  2  0
t2  0  0  2
t3  0  0  2
> svd(docs)
$d
[1] 4.000000 2.828427 0.000000

$u
           [,1]       [,2]       [,3]
[1,] -0.7071068  0.0000000 -0.7071068
[2,] -0.7071068  0.0000000  0.7071068
[3,]  0.0000000 -0.7071068  0.0000000
[4,]  0.0000000 -0.7071068  0.0000000

$v
           [,1] [,2]       [,3]
[1,] -0.7071068    0 -0.7071068
[2,] -0.7071068    0  0.7071068
[3,]  0.0000000   -1  0.0000000
*

Note how my document matrix differs substantially from yours, but that is
simply because we are using different representations.  You have lines that
have triples containing document number, term number and count, I have the
resulting matrix.

As far as my results are concerned, the diagonal component of the svd
(labeled $d above) clearly shows that there are only 2 singular values.
This means that the first two columns of u and v are the only ones necessary
for reconstructing my docs matrix.  The third vector in each represents the
null space of the document matrix.

Moreover, if you look at the first two columns of my u vector, you see a
representation that show that documents tend to contain t0 and t1 in equal
number or they contain t2 and t3 in equal number but they don't tend to
contain any other pattern.  Singular vectors are not normally so easy to
interpret.

For reference, I normally prefer document x term matrices.  Here is that
form of the computation:

*> docs=data.frame(t0=c(2,2,0), t1=c(2,2,0), t2=c(0,0,2), t3=c(0,0,2),
row.names=c("d0","d1","d2"))
> docs
   t0 t1 t2 t3
d0  2  2  0  0
d1  2  2  0  0
d2  0  0  2  2
> svd(docs)
$d
[1] 4.000000 2.828427 0.000000

$u
          [,1] [,2]       [,3]
[1,] 0.7071068    0 -0.7071068
[2,] 0.7071068    0  0.7071068
[3,] 0.0000000    1  0.0000000

$v
          [,1]      [,2]       [,3]
[1,] 0.7071068 0.0000000 -0.7071068
[2,] 0.7071068 0.0000000  0.7071068
[3,] 0.0000000 0.7071068  0.0000000
[4,] 0.0000000 0.7071068  0.0000000

*
The results are the same, of course with some names changed.

On Sun, Oct 18, 2009 at 12:46 AM, prasenjit mukherjee
<prasen.bea@gmail.com>wrote:

> Apologies, as I know the question is actually for lingpipe, but was
> hoping if I could get some response from mahout users as well ( who
> has probably worked with  lingpipe )
>
>
> ---------- Forwarded message ----------
> From: prasenjit mukherjee <prasen.bea@gmail.com>
> Date: Sun, Oct 18, 2009 at 12:39 PM
> Subject: problem Interpreting SVD values
> To: lingpipe <lingpipe@yahoogroups.com>
>
>
> I am trying to evaluate  partialSvd() on a smaller matrix and this is
> what my findings are. Below is my input matrix, assuming 4 terms and 3
> docs.
>
> doc0 => (2,t0) (2,t1)
> doc1 => (2,t0) (2,t1)
> doc2 => (2,t2) (2,t3)
>
> As one can see docs d0,d1 are exactly same containing 4 terms  with 2
> from t0,t1 each.  3rd doc is different containing 4 terms with 2 from
> t2,t3 each. Below is their matrix representation  ( in TXD form ) :
>
> 0,0,2
> 0,1,2
> 1,0,2
> 1,1,2
> 2,2,2
> 2,3,2
>
> I ran with maxOrder =2 and following input  params :
>        double featureInit = 0.01;
>        double initialLearningRate = 0.005;
>        int annealingRate = 1000;
>        double regularization = 0.00;
>        double minImprovement = 0.0001;
>        int minEpochs = 2;
>        int maxEpochs = 100;//50000;
> and was expecting to get d0,d1 in 1 cluster and d2 in another.
> Contrary to my expectation I am getting the following output ( See U,V
> values) :
>
>     [java]       :00 Start
>     [java]       :00   Factor=0
>     [java]       :00     epoch=0 rmse=1.9999848100360043
>     [java]       :00     epoch=1 rmse=1.9999835637692873
>     [java]       :00     epoch=2 rmse=1.999982296871324
>     [java]       :00 Converged in epoch=2 rmse=1.999982296871324
> relDiff=3.167271940722782E-7
>     [java]       :00 Order=0 RMSE=1.9999835637692873
>     [java]       :00   Factor=1
>     [java]       :00     epoch=0 rmse=1.9999522133829444
>     [java]       :00     epoch=1 rmse=1.9999506819096369
>     [java]       :00     epoch=2 rmse=1.99994912138043
>     [java]       :00 Converged in epoch=2 rmse=1.99994912138043
> relDiff=3.901420744799641E-7
>     [java]       :00 Order=1 RMSE=1.9999506819096369
>     [java] SVD Computation Done. Singular Values:
>     [java]     2.796903874825226E-4  2.536844759290206E-4
>     [java] Output U_Matrix: ./rundir/U_out.matrix
>     [java] Output V_Matrix: ./rundir/V_out.matrix
>
>
> And my U,V matrices are :
> U:
> 0,0,-0.690807182791581
> 0,1,0.6535363126818338
> 1,0,0.053924014251416
> 1,1,-0.2055548955329534
> 2,0,-0.7210254065499858
> 2,1,0.7284486755624372
>
> Shouldn't the coeffs of 0 and 1s be the same in U, because they refer
> to d0 and d1  ?
>
> V:
> 0,0,-0.7473523845369358
> 0,1,-0.14168050325102471
> 1,0,0.35114591804331297
> 1,1,0.6137947267695599
> 2,0,-0.4945242525093567
> 2,1,0.776371576839163
> 3,0,0.27130558646761577
> 3,1,0.02073265696164584
>



-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message