mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: PCA doc question for devs:
Date Thu, 06 Sep 2012 01:43:49 GMT
OK, thanks

The SSVD junit test with U*Sigma completes fine. 

On Sep 5, 2012, at 5:37 PM, Dmitriy Lyubimov <> wrote:


With SSVD you need just US, not US^-1. (or U*Sigma in other notation).
This is your dimensionally reduced output of your original document
matrix you've run with --pca option.

As Ted suggests, you may also use US^0.5 which is already produced by
providing --uHalfSigma (or its embedded setter analog). the keys of
that output (produced by getUPath() call) will already contain your
Text document ids as sequence file keys.


On Wed, Sep 5, 2012 at 5:20 PM, Pat Ferrel <> wrote:
> Trying to do dimensionality reduction with SSVD then running the new doc matrix through
> The Lanczos + ClusterDump test of SVD + kmeans uses A-hat = A^t V^t. Unfortunately this
results in anonymous vectors in clusteredPoints after A-hat is run through kmeans. The doc
ids are lost due to the transpose I assume?
> In any case Dmitriy pointed out that this might have been done because Lanczos does not
produce U.
> So I need to do US^-1? This would avoid the transpose and should preserve doc/row ids
for kmeans? And doing the PCA in SSVD will weight things properly so I don't need the --halfSigma?
> Please correct me if I'm wrong.
> On Sep 5, 2012, at 4:59 PM, Dmitriy Lyubimov <> wrote:
> Yes i have an option to output U * Sigma^0.5 already.
> But strictly speaking the way PCA space is defined would require just
> U*Sigma output. Or it is not worth it?
> On Wed, Sep 5, 2012 at 4:56 PM, Ted Dunning <> wrote:
>> Yes.  (A-M)V is U \Sigma.  You may actually want something like U \sqrt
>> \Sigma instead, though.
>> On Wed, Sep 5, 2012 at 4:10 PM, Dmitriy Lyubimov <> wrote:
>>> Hello,
>>> I have a question w.r.t what to advise people in the SSVD manual for PCA.
>>> So we have
>>> (A-M) \approx U \Sigma V^t
>>> and strictly speaking since svd is reduced rank, we need to re-project
>>> original data points as
>>> Y= (A-M)V
>>> However we can assume (A-M)V \approx U \Sigma, can't we? I.e. instead of
>>> recomputing tough job of (A-M)V we can just advise to use U\Sigma or just U
>>> in some cases, can't we?
>>> Thanks.
>>> -d

View raw message