mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Question Regarding Distributed Row Matrix
Date Thu, 05 May 2011 18:38:05 GMT

You say you are doing SVD on image data.  Why are you worrying about Mahout?

I just did a quick test using R and on my laptop it takes random projections
about 5 seconds to extract the
first 50 singular values and corresponding eigenvectors of a 10,000 x 10,000
random dense matrix.
With sufficient memory to store the original matrix roughly twice, you
should be able to get very fast results on
any reasonable sized image.  Even if you have to read the matrix from disk,
you only need to make a few passes
over it to get the results.  Thus, if you have a million rows and 10,000
rows I would expect that you would be
able to do full on SVD in an hour or so.

Because the sequential version is so fast, I would be surprised if you are
able to get significant
wins from any dense matrix I can imagine coming from an image source.
 Dimitriy's random projection code
should be as good as it gets on this, but with dense data I am not so sure
you will see a big win.

On Thu, May 5, 2011 at 11:05 AM, Vckay <> wrote:

> On Thu, May 5, 2011 at 12:22 PM, Jake Mannix <>
> wrote:
> > On Thu, May 5, 2011 at 8:24 AM, Vckay <> wrote:
> >
> > > So I am trying to build PCA. I was recommended in a previous thread
> that
> > it
> > > was better that my data is available at the start as a distributed row
> > > matrix. The work flow (already posted in a previous thread) would be:
> > > 1. Get the data into distributed row matrix format.
> > > 2. Compute empirical mean vector.
> > >
> >
> > Note that as we've mentioned in other threads, this step:
> >
> >
> >
> I know what you guys were saying in the previous thread. I believe I did
> mention that since I would be working with image data that is overwhelming
> dense meaning that even if I did do a subtract from mean, I would
> essentially get a sparse matrix. In fact, running SVD separately on the
> matrix and the low rank matrix (e*m') would probably in this case be a bad
> idea because you would end up having to run the code on a dense matrix.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message