On Thu, May 5, 2011 at 8:24 AM, Vckay <darkvckay@gmail.com> wrote:
> So I am trying to build PCA. I was recommended in a previous thread that it
> was better that my data is available at the start as a distributed row
> matrix. The work flow (already posted in a previous thread) would be:
> 1. Get the data into distributed row matrix format.
> 2. Compute empirical mean vector.
>
Note that as we've mentioned in other threads, this step:
> 3. Either subtract mean from the data
>
will turn your sparse data into dense, if you do that step with current
trunk
code. When you do step 5:
> 4. Find transpose.
> 5. Multiply matrix with its transpose
>
You will be multiplying two large, dense matrices by each other, and this
will take approximately forever. Once forever has passed, you then:
> 6. Perform SVD on resultant matrix. (Lanczos SVD).
>
run SVD on a large, fully dense matrix. This will complete most likely
before
the heat death of the universe, but possibly after the sun has turned into
a red giant and you need to complete it on one of Jupiter's moons, which
will have nicely thawed by that time.
jake
