Pat,
No,
With SSVD you need just US, not US^1. (or U*Sigma in other notation).
This is your dimensionally reduced output of your original document
matrix you've run with pca option.
As Ted suggests, you may also use US^0.5 which is already produced by
providing uHalfSigma (or its embedded setter analog). the keys of
that output (produced by getUPath() call) will already contain your
Text document ids as sequence file keys.
d
On Wed, Sep 5, 2012 at 5:20 PM, Pat Ferrel <pat@occamsmachete.com> wrote:
> Trying to do dimensionality reduction with SSVD then running the new doc matrix through
kmeans.
>
> The Lanczos + ClusterDump test of SVD + kmeans uses Ahat = A^t V^t. Unfortunately this
results in anonymous vectors in clusteredPoints after Ahat is run through kmeans. The doc
ids are lost due to the transpose I assume?
>
> In any case Dmitriy pointed out that this might have been done because Lanczos does not
produce U.
>
> So I need to do US^1? This would avoid the transpose and should preserve doc/row ids
for kmeans? And doing the PCA in SSVD will weight things properly so I don't need the halfSigma?
>
> Please correct me if I'm wrong.
>
>
> On Sep 5, 2012, at 4:59 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>
> Yes i have an option to output U * Sigma^0.5 already.
>
> But strictly speaking the way PCA space is defined would require just
> U*Sigma output. Or it is not worth it?
>
>
> On Wed, Sep 5, 2012 at 4:56 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>> Yes. (AM)V is U \Sigma. You may actually want something like U \sqrt
>> \Sigma instead, though.
>>
>>
>> On Wed, Sep 5, 2012 at 4:10 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a question w.r.t what to advise people in the SSVD manual for PCA.
>>>
>>> So we have
>>>
>>> (AM) \approx U \Sigma V^t
>>>
>>> and strictly speaking since svd is reduced rank, we need to reproject
>>> original data points as
>>>
>>> Y= (AM)V
>>>
>>> However we can assume (AM)V \approx U \Sigma, can't we? I.e. instead of
>>> recomputing tough job of (AM)V we can just advise to use U\Sigma or just U
>>> in some cases, can't we?
>>>
>>> Thanks.
>>> d
>>>
>
