mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <pat.fer...@gmail.com>
Subject Re: PCA doc question for devs:
Date Thu, 06 Sep 2012 01:43:49 GMT
OK, thanks

The SSVD junit test with U*Sigma completes fine. 

On Sep 5, 2012, at 5:37 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

Pat,
No,

With SSVD you need just US, not US^-1. (or U*Sigma in other notation).
This is your dimensionally reduced output of your original document
matrix you've run with --pca option.

As Ted suggests, you may also use US^0.5 which is already produced by
providing --uHalfSigma (or its embedded setter analog). the keys of
that output (produced by getUPath() call) will already contain your
Text document ids as sequence file keys.

-d



On Wed, Sep 5, 2012 at 5:20 PM, Pat Ferrel <pat@occamsmachete.com> wrote:
> Trying to do dimensionality reduction with SSVD then running the new doc matrix through
kmeans.
> 
> The Lanczos + ClusterDump test of SVD + kmeans uses A-hat = A^t V^t. Unfortunately this
results in anonymous vectors in clusteredPoints after A-hat is run through kmeans. The doc
ids are lost due to the transpose I assume?
> 
> In any case Dmitriy pointed out that this might have been done because Lanczos does not
produce U.
> 
> So I need to do US^-1? This would avoid the transpose and should preserve doc/row ids
for kmeans? And doing the PCA in SSVD will weight things properly so I don't need the --halfSigma?
> 
> Please correct me if I'm wrong.
> 
> 
> On Sep 5, 2012, at 4:59 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> 
> Yes i have an option to output U * Sigma^0.5 already.
> 
> But strictly speaking the way PCA space is defined would require just
> U*Sigma output. Or it is not worth it?
> 
> 
> On Wed, Sep 5, 2012 at 4:56 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>> Yes.  (A-M)V is U \Sigma.  You may actually want something like U \sqrt
>> \Sigma instead, though.
>> 
>> 
>> On Wed, Sep 5, 2012 at 4:10 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>> 
>>> Hello,
>>> 
>>> I have a question w.r.t what to advise people in the SSVD manual for PCA.
>>> 
>>> So we have
>>> 
>>> (A-M) \approx U \Sigma V^t
>>> 
>>> and strictly speaking since svd is reduced rank, we need to re-project
>>> original data points as
>>> 
>>> Y= (A-M)V
>>> 
>>> However we can assume (A-M)V \approx U \Sigma, can't we? I.e. instead of
>>> recomputing tough job of (A-M)V we can just advise to use U\Sigma or just U
>>> in some cases, can't we?
>>> 
>>> Thanks.
>>> -d
>>> 
> 


Mime
View raw message