mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Question Regarding Distributed Row Matrix
Date Thu, 05 May 2011 17:22:37 GMT
On Thu, May 5, 2011 at 8:24 AM, Vckay <darkvckay@gmail.com> wrote:

> So I am trying to build PCA. I was recommended in a previous thread that it
> was better that my data is available at the start as a distributed row
> matrix. The work flow (already posted in a previous thread) would be:
> 1. Get the data into distributed row matrix format.
> 2. Compute empirical mean vector.
>

Note that as we've mentioned in other threads, this step:


> 3. Either subtract mean from the data
>

will turn your sparse data into dense, if you do that step with current
trunk
code.  When you do step 5:


> 4. Find transpose.
> 5. Multiply matrix with its transpose
>

You will be multiplying two large, dense matrices by each other, and this
will take approximately forever.  Once forever has passed, you then:


> 6. Perform SVD on resultant matrix. (Lanczos SVD).
>

run SVD on a large, fully dense matrix.  This will complete most likely
before
the heat death of the universe, but possibly after the sun has turned into
a red giant and you need to complete it on one of Jupiter's moons, which
will have nicely thawed by that time.

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message