mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Doing dimensionality reduction with SSVD and Lanczos
Date Thu, 06 Sep 2012 20:01:25 GMT
On Thu, Sep 6, 2012 at 10:17 AM, Pat Ferrel <pat@occamsmachete.com> wrote:
> When using Laczos the recommendation is to use clean eigen vectors as a distributed row
matrix--call it V.
>
> A-hat = A^t V^t this per the clusterdump tests DSVD and DSVD2.

I am not quite sure where this comes from. (for once, A is m x n and V
is n x k so product A^t V^t  just does not exist).
PCA space is defined as AV.


>
> Dmitriy and Ted recommend when using SSVD to do:
>
> A-hat = US

Yes. since PCA=AV and AV is approximately the same as U*Sigma.

SSVD already produces U and Sigma  (or U*Sigma^0.5 as single output
product) so there's no need to mess with AV product again.

>
> When using PCA it's also preferable to use --uHalfSigma to create U with the SSVD solver.
One difficulty is that to perform the multiplication you have to turn the singular values
vector (diagonal values) into a distributed row matrix or write your own multiply function,
correct?

You could do that, but why? Sigma is a diagonal matrix (which
additionally encoded as a very short vector of singular values of
length k, say we denote it as 'sv'). Given that, there's absolutely 0
reason to encode it as Distributed row matrix.

Multiplication can be done on the fly as you load U, row by row:
U*Sigma[i,j]=U[i,j]*sv[j]

One inconvenience with that approach is that it assumes you can freely
hack the code that loads U matrix for further processing.

It is much easier to have SSVD to output U*Sigma directly using the
same logic as above (requires a patch) or just have it output
U*Sigma^0.5 (does not require a patch).

You could even use U in some cases directly, but part of the problem
is that data variances will be normalized in all directions compared
to PCA space, which will affect actual distances between data points.
If you want to retain proportions of the directional variances as in
your original input, you need to use principal components with scaling
applied, i.e. U*Sigma.

>
> Questions:
> For SSVD can someone explain why US is preferred? Given A = USV^t how can you ignore
the effect of V^t? Is this only for PCA? In other words if you did not use PCA weighting would
you ignore V^t?


US is not preferred. (preferred over what?) it's the approximation of
definition of PCA which is (A-M)V (assuming row-wise points) or
U^t(A-M) assuming columnar data points. But since SSVD --pca deals
specifically with row-wise data points for PCA purposes (it uses
column mean subtraction) then what we really have is PCA equiv (A-M)V.
Since (A-M) approx. = USigmaV^t it follows that (A-M)V approx = U*
Sigma.

The rest of this question i don't understand.

> For Lanczos A-hat = A^t V^t seems to strip doc id during transpose, am I mistaken? Also
shouldn't A-hat be transposed before performing kmeans or other analysis?
>
>
>
>> Dmitriy said
> With SSVD you need just US  (or U*Sigma in other notation).
> This is your dimensionally reduced output of your original document
> matrix you've run with --pca option.
>
> As Ted suggests, you may also use US^0.5 which is already produced by
> providing --uHalfSigma (or its embedded setter analog). the keys of
> that output (produced by getUPath() call) will already contain your
> Text document ids as sequence file keys.
>
> -d
>
>

Mime
View raw message