mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chirag Lakhani <clakh...@zaloni.com>
Subject Re: PCA using Java Code
Date Wed, 03 Jul 2013 19:25:57 GMT
Okay thanks for that.  After working on that issue I am still having
trouble running the SSVD solver.  I know I have asked this before but I
still can not initiate the SSVD solver when the input called inputFolder is
the location of the sequence files of DenseVectors.  Is there something I
am missing with this code?


String inputFolder = "/data_csv_for_pca/";
                String pcaOutput =  "/vectors/";
                String column_type = "DenseVector";
                Path input_vec = new Path(inputFolder);

 SSVDSolver solver  = new SSVDSolver(conf, new Path[] {input_vec}, new
Path(pcaOutput),18,5,3,10);


On Wed, Jul 3, 2013 at 12:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> There's probably confusion about options.
>
> (1) --pca=true enables pca flow in general. There's more to it than just
> taking a mean and re-centering.
> (2) --us=true enables computation of U*Sigma flow which what approximates
> dimensionality reduced output with original variances. This is what one
> usually wants from PCA, although in some cases it may be useful to just use
> U.
> (3) optionally, one may supply externally computed colmean by using
> --pcaOffset. Motivation behind this option is that usually PCA is never a
> standalone job in a pipeline. Usually there's a MR job that preps the PCA
> input, in which case it is very easy to take row averages in the reducers
> of the previous step (and do final averaging in front end). That saves one
> MR pass over the input, because in SSVD average will require one additional
> MR pass over A.
>
> Bottom line, typically one wants something along the lines
>
> ssvd --pca=true -u=false -v=false -us=true ...
>
>
>
>
> On Wed, Jul 3, 2013 at 8:58 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
>
> >
> > On Jul 3, 2013 6:56 AM, "Chirag Lakhani" <clakhani@zaloni.com> wrote:
> > >
> > > So how does the column mean get calculated if the --pcaOffset option is
> > not
> > By taking average of all row vectors. See code for details.
> >
> > > specified?  I would think you are just doing SVD at that point.
> > This statement is incorrect. I know becuse i designed this code.
> >
> > >
> > >
> > > On Tue, Jul 2, 2013 at 5:52 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > wrote:
> > >
> > > > On Tue, Jul 2, 2013 at 1:52 PM, Chirag Lakhani <clakhani@zaloni.com>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I am trying to use the Mahout/Java API to do PCA but I am confused
> > about
> > > > > the write order to do things.  To start, I have a list of
> > DenseVectors
> > > > that
> > > > > I am reading into the code and turning it into a distributed matrix
> > in
> > > > the
> > > > > following form.
> > > > >
> > > > >  DistributedRowMatrix m = new DistributedRowMatrix(input_vec,
> > > > matrix_path,
> > > > > num_rows,num_cols);
> > > > >
> > > > > When I run this code, I would have thought it would output the
> result
> > > > into
> > > > > the path called "matrix_path" so that I can then use something like
> > > > > MatrixColumnMeansJob.run
> > > > > to get mean. When I run this bit of code I get no output, is there
> > > > > something else I should do or is there a better way to calculate
> the
> > mean
> > > > > for my file.
> > > > >
> > > > >
> > > > > From what I understand about the SSVD CI code, you need to
> calculate
> > the
> > > > > column mean and then output it into a directory
> > > >
> > > > .
> > > >
> > > >
> > > > No, you don't have to (although you have an _option_ to calculate and
> > > > substitute one yourself if for some reason it is already known.)
> > Default
> > > > use assumes it would calculate it for you.
> > > >
> > > >
> > > >
> > > > > Is there a good way to do
> > > > > this if I am starting from a file which is a sequence file of
> > > > DenseVectors?
> > > > >
> > > >
> > > > Yes. just don't specify --pcaOffset option.
> > > >
> > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Chirag Lakhani*
> > > > >
> > > > > Data Scientist
> > > > >
> > > > > Zaloni, Inc. | www.zaloni.com
> > > > >
> > > > > 633 Davis Dr., Suite 200
> > > > >
> > > > > Durham, NC 27713
> > > > > e: clakhani@zaloni.com
> > > > > p: 919.602.4965 x7020
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > *Chirag Lakhani*
> > >
> > > Data Scientist
> > >
> > > Zaloni, Inc. | www.zaloni.com
> > >
> > > 633 Davis Dr., Suite 200
> > >
> > > Durham, NC 27713
> > > e: clakhani@zaloni.com
> > > p: 919.602.4965 x7020
> >
> >
>



-- 

*Chirag Lakhani*

Data Scientist

Zaloni, Inc. | www.zaloni.com

633 Davis Dr., Suite 200

Durham, NC 27713
e: clakhani@zaloni.com
p: 919.602.4965 x7020

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message