mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <>
Subject Re: Question Regarding Distributed Row Matrix
Date Thu, 05 May 2011 17:22:37 GMT
On Thu, May 5, 2011 at 8:24 AM, Vckay <> wrote:

> So I am trying to build PCA. I was recommended in a previous thread that it
> was better that my data is available at the start as a distributed row
> matrix. The work flow (already posted in a previous thread) would be:
> 1. Get the data into distributed row matrix format.
> 2. Compute empirical mean vector.

Note that as we've mentioned in other threads, this step:

> 3. Either subtract mean from the data

will turn your sparse data into dense, if you do that step with current
code.  When you do step 5:

> 4. Find transpose.
> 5. Multiply matrix with its transpose

You will be multiplying two large, dense matrices by each other, and this
will take approximately forever.  Once forever has passed, you then:

> 6. Perform SVD on resultant matrix. (Lanczos SVD).

run SVD on a large, fully dense matrix.  This will complete most likely
the heat death of the universe, but possibly after the sun has turned into
a red giant and you need to complete it on one of Jupiter's moons, which
will have nicely thawed by that time.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message