mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: PCA using Java Code
Date Wed, 03 Jul 2013 20:39:14 GMT
yeah. specifically this code computes the mean (it is called "xi" to
conform to notations used in math solution for MAHOUT-817)

    // MAHOUT-817
    if (pca && xiPath == null) {
      xiPath = new Path(tempPath, "xi");
      if (overwrite) {
        fs.delete(xiPath, true);
      }
   ====>   MatrixColumnMeansJob.run(conf, inputPaths[0], xiPath);
    }

... and  then passing it all to the SVD solver .. :

SVDSolver solver =
      new SSVDSolver(conf,
                     inputPaths,
                     new Path(tempPath, "ssvd"),
                     r,
                     k,
                     p,
                     reduceTasks);

    solver.setMinSplitSize(minSplitSize);
    solver.setComputeU(computeU);
    solver.setComputeV(computeV);
    solver.setcUHalfSigma(cUHalfSigma);
    solver.setcVHalfSigma(cVHalfSigma);
    solver.setcUSigma(cUSigma);
    solver.setOuterBlockHeight(h);
    solver.setAbtBlockHeight(abh);
    solver.setQ(q);
    solver.setBroadcast(broadcast);
    solver.setOverwrite(overwrite);


    if (xiPath != null) {
====>      solver.setPcaMeanPath(new Path(xiPath, "part-*"));
    }



essential pieces  marked with double arrows.


On Wed, Jul 3, 2013 at 1:34 PM, Chirag Lakhani <clakhani@zaloni.com> wrote:

> okay thanks.  It looks like I have that part running so I will go back to
> the SSVDCli to finish the rest.  Thanks for your help.
>
> Chirag
>
>
> On Wed, Jul 3, 2013 at 4:19 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
>
> > On Wed, Jul 3, 2013 at 12:25 PM, Chirag Lakhani <clakhani@zaloni.com>
> > wrote:
> >
> > > Okay thanks for that.  After working on that issue I am still having
> > > trouble running the SSVD solver.  I know I have asked this before but I
> > > still can not initiate the SSVD solver when the input called
> inputFolder
> > is
> > > the location of the sequence files of DenseVectors.  Is there
> something I
> > > am missing with this code?
> > >
> > >
> > > String inputFolder = "/data_csv_for_pca/";
> > >                 String pcaOutput =  "/vectors/";
> > >                 String column_type = "DenseVector";
> > >                 Path input_vec = new Path(inputFolder);
> > >
> > >  SSVDSolver solver  = new SSVDSolver(conf, new Path[] {input_vec}, new
> > > Path(pcaOutput),18,5,3,10);
> > >
> >
> >
> > SSVDSolver does not encapsulate the entire PCA workflow on its own.
> >
> >  You can use SSVDCli as an example to build the entire thing to embed.
> > SSVDSolver class does not compute pca offset on its own, SSVDCli uses
> > another job from Distributed Matrix to compute that (again, see SSVDCli
> > code).
> >
> > Problems with not finding input -- about 1 million reasons in your case.
> > Try to use absolute hdfs:// -prefixed paths for all files.
> >
> >
> > >
> > >
> > > On Wed, Jul 3, 2013 at 12:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > > wrote:
> > >
> > > > There's probably confusion about options.
> > > >
> > > > (1) --pca=true enables pca flow in general. There's more to it than
> > just
> > > > taking a mean and re-centering.
> > > > (2) --us=true enables computation of U*Sigma flow which what
> > approximates
> > > > dimensionality reduced output with original variances. This is what
> one
> > > > usually wants from PCA, although in some cases it may be useful to
> just
> > > use
> > > > U.
> > > > (3) optionally, one may supply externally computed colmean by using
> > > > --pcaOffset. Motivation behind this option is that usually PCA is
> > never a
> > > > standalone job in a pipeline. Usually there's a MR job that preps the
> > PCA
> > > > input, in which case it is very easy to take row averages in the
> > reducers
> > > > of the previous step (and do final averaging in front end). That
> saves
> > > one
> > > > MR pass over the input, because in SSVD average will require one
> > > additional
> > > > MR pass over A.
> > > >
> > > > Bottom line, typically one wants something along the lines
> > > >
> > > > ssvd --pca=true -u=false -v=false -us=true ...
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Jul 3, 2013 at 8:58 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > > > wrote:
> > > >
> > > > >
> > > > > On Jul 3, 2013 6:56 AM, "Chirag Lakhani" <clakhani@zaloni.com>
> > wrote:
> > > > > >
> > > > > > So how does the column mean get calculated if the --pcaOffset
> > option
> > > is
> > > > > not
> > > > > By taking average of all row vectors. See code for details.
> > > > >
> > > > > > specified?  I would think you are just doing SVD at that point.
> > > > > This statement is incorrect. I know becuse i designed this code.
> > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jul 2, 2013 at 5:52 PM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > On Tue, Jul 2, 2013 at 1:52 PM, Chirag Lakhani <
> > > clakhani@zaloni.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > I am trying to use the Mahout/Java API to do PCA but
I am
> > > confused
> > > > > about
> > > > > > > > the write order to do things.  To start, I have a
list of
> > > > > DenseVectors
> > > > > > > that
> > > > > > > > I am reading into the code and turning it into a distributed
> > > matrix
> > > > > in
> > > > > > > the
> > > > > > > > following form.
> > > > > > > >
> > > > > > > >  DistributedRowMatrix m = new DistributedRowMatrix(input_vec,
> > > > > > > matrix_path,
> > > > > > > > num_rows,num_cols);
> > > > > > > >
> > > > > > > > When I run this code, I would have thought it would
output
> the
> > > > result
> > > > > > > into
> > > > > > > > the path called "matrix_path" so that I can then use
> something
> > > like
> > > > > > > > MatrixColumnMeansJob.run
> > > > > > > > to get mean. When I run this bit of code I get no
output, is
> > > there
> > > > > > > > something else I should do or is there a better way
to
> > calculate
> > > > the
> > > > > mean
> > > > > > > > for my file.
> > > > > > > >
> > > > > > > >
> > > > > > > > From what I understand about the SSVD CI code, you
need to
> > > > calculate
> > > > > the
> > > > > > > > column mean and then output it into a directory
> > > > > > >
> > > > > > > .
> > > > > > >
> > > > > > >
> > > > > > > No, you don't have to (although you have an _option_ to
> calculate
> > > and
> > > > > > > substitute one yourself if for some reason it is already
> known.)
> > > > > Default
> > > > > > > use assumes it would calculate it for you.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Is there a good way to do
> > > > > > > > this if I am starting from a file which is a sequence
file of
> > > > > > > DenseVectors?
> > > > > > > >
> > > > > > >
> > > > > > > Yes. just don't specify --pcaOffset option.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > *Chirag Lakhani*
> > > > > > > >
> > > > > > > > Data Scientist
> > > > > > > >
> > > > > > > > Zaloni, Inc. | www.zaloni.com
> > > > > > > >
> > > > > > > > 633 Davis Dr., Suite 200
> > > > > > > >
> > > > > > > > Durham, NC 27713
> > > > > > > > e: clakhani@zaloni.com
> > > > > > > > p: 919.602.4965 x7020
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > *Chirag Lakhani*
> > > > > >
> > > > > > Data Scientist
> > > > > >
> > > > > > Zaloni, Inc. | www.zaloni.com
> > > > > >
> > > > > > 633 Davis Dr., Suite 200
> > > > > >
> > > > > > Durham, NC 27713
> > > > > > e: clakhani@zaloni.com
> > > > > > p: 919.602.4965 x7020
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > *Chirag Lakhani*
> > >
> > > Data Scientist
> > >
> > > Zaloni, Inc. | www.zaloni.com
> > >
> > > 633 Davis Dr., Suite 200
> > >
> > > Durham, NC 27713
> > > e: clakhani@zaloni.com
> > > p: 919.602.4965 x7020
> > >
> >
>
>
>
> --
>
> *Chirag Lakhani*
>
> Data Scientist
>
> Zaloni, Inc. | www.zaloni.com
>
> 633 Davis Dr., Suite 200
>
> Durham, NC 27713
> e: clakhani@zaloni.com
> p: 919.602.4965 x7020
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message