Indeed, if image is the individual source, it is way too small of a
problem .. hadoop is not going to get you any win imo.
On Thu, May 5, 2011 at 11:38 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Vckay,
>
> You say you are doing SVD on image data. Why are you worrying about Mahout?
>
> I just did a quick test using R and on my laptop it takes random projections
> about 5 seconds to extract the
> first 50 singular values and corresponding eigenvectors of a 10,000 x 10,000
> random dense matrix.
> With sufficient memory to store the original matrix roughly twice, you
> should be able to get very fast results on
> any reasonable sized image. Even if you have to read the matrix from disk,
> you only need to make a few passes
> over it to get the results. Thus, if you have a million rows and 10,000
> rows I would expect that you would be
> able to do full on SVD in an hour or so.
>
> Because the sequential version is so fast, I would be surprised if you are
> able to get significant
> wins from any dense matrix I can imagine coming from an image source.
> Dimitriy's random projection code
> should be as good as it gets on this, but with dense data I am not so sure
> you will see a big win.
>
>
>
> On Thu, May 5, 2011 at 11:05 AM, Vckay <darkvckay@gmail.com> wrote:
>
>> On Thu, May 5, 2011 at 12:22 PM, Jake Mannix <jake.mannix@gmail.com>
>> wrote:
>>
>> > On Thu, May 5, 2011 at 8:24 AM, Vckay <darkvckay@gmail.com> wrote:
>> >
>> > > So I am trying to build PCA. I was recommended in a previous thread
>> that
>> > it
>> > > was better that my data is available at the start as a distributed row
>> > > matrix. The work flow (already posted in a previous thread) would be:
>> > > 1. Get the data into distributed row matrix format.
>> > > 2. Compute empirical mean vector.
>> > >
>> >
>> > Note that as we've mentioned in other threads, this step:
>> >
>> >
>> >
>> I know what you guys were saying in the previous thread. I believe I did
>> mention that since I would be working with image data that is overwhelming
>> dense meaning that even if I did do a subtract from mean, I would
>> essentially get a sparse matrix. In fact, running SVD separately on the
>> matrix and the low rank matrix (e*m') would probably in this case be a bad
>> idea because you would end up having to run the code on a dense matrix.
>>
>
