mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: PCA using Java Code
Date Wed, 03 Jul 2013 20:19:59 GMT
On Wed, Jul 3, 2013 at 12:25 PM, Chirag Lakhani <clakhani@zaloni.com> wrote:

> Okay thanks for that.  After working on that issue I am still having
> trouble running the SSVD solver.  I know I have asked this before but I
> still can not initiate the SSVD solver when the input called inputFolder is
> the location of the sequence files of DenseVectors.  Is there something I
> am missing with this code?
>
>
> String inputFolder = "/data_csv_for_pca/";
>                 String pcaOutput =  "/vectors/";
>                 String column_type = "DenseVector";
>                 Path input_vec = new Path(inputFolder);
>
>  SSVDSolver solver  = new SSVDSolver(conf, new Path[] {input_vec}, new
> Path(pcaOutput),18,5,3,10);
>


SSVDSolver does not encapsulate the entire PCA workflow on its own.

 You can use SSVDCli as an example to build the entire thing to embed.
SSVDSolver class does not compute pca offset on its own, SSVDCli uses
another job from Distributed Matrix to compute that (again, see SSVDCli
code).

Problems with not finding input -- about 1 million reasons in your case.
Try to use absolute hdfs:// -prefixed paths for all files.


>
>
> On Wed, Jul 3, 2013 at 12:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
>
> > There's probably confusion about options.
> >
> > (1) --pca=true enables pca flow in general. There's more to it than just
> > taking a mean and re-centering.
> > (2) --us=true enables computation of U*Sigma flow which what approximates
> > dimensionality reduced output with original variances. This is what one
> > usually wants from PCA, although in some cases it may be useful to just
> use
> > U.
> > (3) optionally, one may supply externally computed colmean by using
> > --pcaOffset. Motivation behind this option is that usually PCA is never a
> > standalone job in a pipeline. Usually there's a MR job that preps the PCA
> > input, in which case it is very easy to take row averages in the reducers
> > of the previous step (and do final averaging in front end). That saves
> one
> > MR pass over the input, because in SSVD average will require one
> additional
> > MR pass over A.
> >
> > Bottom line, typically one wants something along the lines
> >
> > ssvd --pca=true -u=false -v=false -us=true ...
> >
> >
> >
> >
> > On Wed, Jul 3, 2013 at 8:58 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > wrote:
> >
> > >
> > > On Jul 3, 2013 6:56 AM, "Chirag Lakhani" <clakhani@zaloni.com> wrote:
> > > >
> > > > So how does the column mean get calculated if the --pcaOffset option
> is
> > > not
> > > By taking average of all row vectors. See code for details.
> > >
> > > > specified?  I would think you are just doing SVD at that point.
> > > This statement is incorrect. I know becuse i designed this code.
> > >
> > > >
> > > >
> > > > On Tue, Jul 2, 2013 at 5:52 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > > wrote:
> > > >
> > > > > On Tue, Jul 2, 2013 at 1:52 PM, Chirag Lakhani <
> clakhani@zaloni.com>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I am trying to use the Mahout/Java API to do PCA but I am
> confused
> > > about
> > > > > > the write order to do things.  To start, I have a list of
> > > DenseVectors
> > > > > that
> > > > > > I am reading into the code and turning it into a distributed
> matrix
> > > in
> > > > > the
> > > > > > following form.
> > > > > >
> > > > > >  DistributedRowMatrix m = new DistributedRowMatrix(input_vec,
> > > > > matrix_path,
> > > > > > num_rows,num_cols);
> > > > > >
> > > > > > When I run this code, I would have thought it would output the
> > result
> > > > > into
> > > > > > the path called "matrix_path" so that I can then use something
> like
> > > > > > MatrixColumnMeansJob.run
> > > > > > to get mean. When I run this bit of code I get no output, is
> there
> > > > > > something else I should do or is there a better way to calculate
> > the
> > > mean
> > > > > > for my file.
> > > > > >
> > > > > >
> > > > > > From what I understand about the SSVD CI code, you need to
> > calculate
> > > the
> > > > > > column mean and then output it into a directory
> > > > >
> > > > > .
> > > > >
> > > > >
> > > > > No, you don't have to (although you have an _option_ to calculate
> and
> > > > > substitute one yourself if for some reason it is already known.)
> > > Default
> > > > > use assumes it would calculate it for you.
> > > > >
> > > > >
> > > > >
> > > > > > Is there a good way to do
> > > > > > this if I am starting from a file which is a sequence file of
> > > > > DenseVectors?
> > > > > >
> > > > >
> > > > > Yes. just don't specify --pcaOffset option.
> > > > >
> > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > *Chirag Lakhani*
> > > > > >
> > > > > > Data Scientist
> > > > > >
> > > > > > Zaloni, Inc. | www.zaloni.com
> > > > > >
> > > > > > 633 Davis Dr., Suite 200
> > > > > >
> > > > > > Durham, NC 27713
> > > > > > e: clakhani@zaloni.com
> > > > > > p: 919.602.4965 x7020
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *Chirag Lakhani*
> > > >
> > > > Data Scientist
> > > >
> > > > Zaloni, Inc. | www.zaloni.com
> > > >
> > > > 633 Davis Dr., Suite 200
> > > >
> > > > Durham, NC 27713
> > > > e: clakhani@zaloni.com
> > > > p: 919.602.4965 x7020
> > >
> > >
> >
>
>
>
> --
>
> *Chirag Lakhani*
>
> Data Scientist
>
> Zaloni, Inc. | www.zaloni.com
>
> 633 Davis Dr., Suite 200
>
> Durham, NC 27713
> e: clakhani@zaloni.com
> p: 919.602.4965 x7020
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message