mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: PCA using Java Code
Date Wed, 03 Jul 2013 16:24:36 GMT
There's probably confusion about options.

(1) --pca=true enables pca flow in general. There's more to it than just
taking a mean and re-centering.
(2) --us=true enables computation of U*Sigma flow which what approximates
dimensionality reduced output with original variances. This is what one
usually wants from PCA, although in some cases it may be useful to just use
U.
(3) optionally, one may supply externally computed colmean by using
--pcaOffset. Motivation behind this option is that usually PCA is never a
standalone job in a pipeline. Usually there's a MR job that preps the PCA
input, in which case it is very easy to take row averages in the reducers
of the previous step (and do final averaging in front end). That saves one
MR pass over the input, because in SSVD average will require one additional
MR pass over A.

Bottom line, typically one wants something along the lines

ssvd --pca=true -u=false -v=false -us=true ...




On Wed, Jul 3, 2013 at 8:58 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

>
> On Jul 3, 2013 6:56 AM, "Chirag Lakhani" <clakhani@zaloni.com> wrote:
> >
> > So how does the column mean get calculated if the --pcaOffset option is
> not
> By taking average of all row vectors. See code for details.
>
> > specified?  I would think you are just doing SVD at that point.
> This statement is incorrect. I know becuse i designed this code.
>
> >
> >
> > On Tue, Jul 2, 2013 at 5:52 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
> >
> > > On Tue, Jul 2, 2013 at 1:52 PM, Chirag Lakhani <clakhani@zaloni.com>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I am trying to use the Mahout/Java API to do PCA but I am confused
> about
> > > > the write order to do things.  To start, I have a list of
> DenseVectors
> > > that
> > > > I am reading into the code and turning it into a distributed matrix
> in
> > > the
> > > > following form.
> > > >
> > > >  DistributedRowMatrix m = new DistributedRowMatrix(input_vec,
> > > matrix_path,
> > > > num_rows,num_cols);
> > > >
> > > > When I run this code, I would have thought it would output the result
> > > into
> > > > the path called "matrix_path" so that I can then use something like
> > > > MatrixColumnMeansJob.run
> > > > to get mean. When I run this bit of code I get no output, is there
> > > > something else I should do or is there a better way to calculate the
> mean
> > > > for my file.
> > > >
> > > >
> > > > From what I understand about the SSVD CI code, you need to calculate
> the
> > > > column mean and then output it into a directory
> > >
> > > .
> > >
> > >
> > > No, you don't have to (although you have an _option_ to calculate and
> > > substitute one yourself if for some reason it is already known.)
> Default
> > > use assumes it would calculate it for you.
> > >
> > >
> > >
> > > > Is there a good way to do
> > > > this if I am starting from a file which is a sequence file of
> > > DenseVectors?
> > > >
> > >
> > > Yes. just don't specify --pcaOffset option.
> > >
> > >
> > > >
> > > > --
> > > >
> > > > *Chirag Lakhani*
> > > >
> > > > Data Scientist
> > > >
> > > > Zaloni, Inc. | www.zaloni.com
> > > >
> > > > 633 Davis Dr., Suite 200
> > > >
> > > > Durham, NC 27713
> > > > e: clakhani@zaloni.com
> > > > p: 919.602.4965 x7020
> > > >
> > >
> >
> >
> >
> > --
> >
> > *Chirag Lakhani*
> >
> > Data Scientist
> >
> > Zaloni, Inc. | www.zaloni.com
> >
> > 633 Davis Dr., Suite 200
> >
> > Durham, NC 27713
> > e: clakhani@zaloni.com
> > p: 919.602.4965 x7020
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message