So I am trying to build PCA. I was recommended in a previous thread that it
was better that my data is available at the start as a distributed row
matrix. The work flow (already posted in a previous thread) would be:
1. Get the data into distributed row matrix format.
2. Compute empirical mean vector.
3. Either subtract mean from the data
4. Find transpose.
5. Multiply matrix with its transpose
6. Perform SVD on resultant matrix. (Lanczos SVD).
Now I could avoid step 1 and do step 2,3 without worrying of
DistributedRowMatrix format. I guess I could to step 4 too manually (even
though to be frank, I don't understand your second paragraph i.e, how
turning the vectors into sparse vectors will enable me to do transpose in a
easier fashion without resorting to doing it manually), however, I suppose
the purpose of the DRM format was to make step 5,6 much easier so I guess I
have to figure out how to get it in this format to be able to these steps.
Thanks again!
On Thu, May 5, 2011 at 9:40 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> I think first step is to decide on pipeline of algorithms. Once u know the
> algorithms u want to run thru, it would be easier to come up with
> vectorization requirements.
>
> That said, for the sake of trasposition, note that mahout supports sparse
> vectors, I. e. It doesn't matter what the element index is, for as long as
> it unique, only how many nonzero elements, does. So I don't think that u
> are
> per se constrained in number of reducers during vectorization for
> transpose.
> That would have been pretty scale restricting, indeed.
>
> apologies for brevity.
>
> Sent from my android.
> Dmitriy
> On May 5, 2011 6:58 AM, "Vckay" <darkvckay@gmail.com> wrote:
>
