mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chirag Lakhani <clakh...@zaloni.com>
Subject Re: PCA using Java Code
Date Wed, 03 Jul 2013 19:28:27 GMT
I get this error

Exception in thread "main" java.io.IOException: Unable to open input files
to determine input label type.
        at
org.apache.mahout.math.hadoop.stochasticsvd.SSVDHelper.sniffInputLabelType(SSVDHelper.java:143)
        at
org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:328)
        at pca_factory.main(pca_factory.java:97)



On Wed, Jul 3, 2013 at 3:25 PM, Chirag Lakhani <clakhani@zaloni.com> wrote:

> Okay thanks for that.  After working on that issue I am still having
> trouble running the SSVD solver.  I know I have asked this before but I
> still can not initiate the SSVD solver when the input called inputFolder is
> the location of the sequence files of DenseVectors.  Is there something I
> am missing with this code?
>
>
> String inputFolder = "/data_csv_for_pca/";
>                 String pcaOutput =  "/vectors/";
>                 String column_type = "DenseVector";
>                 Path input_vec = new Path(inputFolder);
>
>  SSVDSolver solver  = new SSVDSolver(conf, new Path[] {input_vec}, new
> Path(pcaOutput),18,5,3,10);
>
>
> On Wed, Jul 3, 2013 at 12:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>wrote:
>
>> There's probably confusion about options.
>>
>> (1) --pca=true enables pca flow in general. There's more to it than just
>> taking a mean and re-centering.
>> (2) --us=true enables computation of U*Sigma flow which what approximates
>> dimensionality reduced output with original variances. This is what one
>> usually wants from PCA, although in some cases it may be useful to just
>> use
>> U.
>> (3) optionally, one may supply externally computed colmean by using
>> --pcaOffset. Motivation behind this option is that usually PCA is never a
>> standalone job in a pipeline. Usually there's a MR job that preps the PCA
>> input, in which case it is very easy to take row averages in the reducers
>> of the previous step (and do final averaging in front end). That saves one
>> MR pass over the input, because in SSVD average will require one
>> additional
>> MR pass over A.
>>
>> Bottom line, typically one wants something along the lines
>>
>> ssvd --pca=true -u=false -v=false -us=true ...
>>
>>
>>
>>
>> On Wed, Jul 3, 2013 at 8:58 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> wrote:
>>
>> >
>> > On Jul 3, 2013 6:56 AM, "Chirag Lakhani" <clakhani@zaloni.com> wrote:
>> > >
>> > > So how does the column mean get calculated if the --pcaOffset option
>> is
>> > not
>> > By taking average of all row vectors. See code for details.
>> >
>> > > specified?  I would think you are just doing SVD at that point.
>> > This statement is incorrect. I know becuse i designed this code.
>> >
>> > >
>> > >
>> > > On Tue, Jul 2, 2013 at 5:52 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> > wrote:
>> > >
>> > > > On Tue, Jul 2, 2013 at 1:52 PM, Chirag Lakhani <clakhani@zaloni.com
>> >
>> > > > wrote:
>> > > >
>> > > > > Hello,
>> > > > >
>> > > > > I am trying to use the Mahout/Java API to do PCA but I am confused
>> > about
>> > > > > the write order to do things.  To start, I have a list of
>> > DenseVectors
>> > > > that
>> > > > > I am reading into the code and turning it into a distributed
>> matrix
>> > in
>> > > > the
>> > > > > following form.
>> > > > >
>> > > > >  DistributedRowMatrix m = new DistributedRowMatrix(input_vec,
>> > > > matrix_path,
>> > > > > num_rows,num_cols);
>> > > > >
>> > > > > When I run this code, I would have thought it would output the
>> result
>> > > > into
>> > > > > the path called "matrix_path" so that I can then use something
>> like
>> > > > > MatrixColumnMeansJob.run
>> > > > > to get mean. When I run this bit of code I get no output, is
there
>> > > > > something else I should do or is there a better way to calculate
>> the
>> > mean
>> > > > > for my file.
>> > > > >
>> > > > >
>> > > > > From what I understand about the SSVD CI code, you need to
>> calculate
>> > the
>> > > > > column mean and then output it into a directory
>> > > >
>> > > > .
>> > > >
>> > > >
>> > > > No, you don't have to (although you have an _option_ to calculate
>> and
>> > > > substitute one yourself if for some reason it is already known.)
>> > Default
>> > > > use assumes it would calculate it for you.
>> > > >
>> > > >
>> > > >
>> > > > > Is there a good way to do
>> > > > > this if I am starting from a file which is a sequence file of
>> > > > DenseVectors?
>> > > > >
>> > > >
>> > > > Yes. just don't specify --pcaOffset option.
>> > > >
>> > > >
>> > > > >
>> > > > > --
>> > > > >
>> > > > > *Chirag Lakhani*
>> > > > >
>> > > > > Data Scientist
>> > > > >
>> > > > > Zaloni, Inc. | www.zaloni.com
>> > > > >
>> > > > > 633 Davis Dr., Suite 200
>> > > > >
>> > > > > Durham, NC 27713
>> > > > > e: clakhani@zaloni.com
>> > > > > p: 919.602.4965 x7020
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > >
>> > > *Chirag Lakhani*
>> > >
>> > > Data Scientist
>> > >
>> > > Zaloni, Inc. | www.zaloni.com
>> > >
>> > > 633 Davis Dr., Suite 200
>> > >
>> > > Durham, NC 27713
>> > > e: clakhani@zaloni.com
>> > > p: 919.602.4965 x7020
>> >
>> >
>>
>
>
>
> --
>
> *Chirag Lakhani*
>
> Data Scientist
>
> Zaloni, Inc. | www.zaloni.com
>
> 633 Davis Dr., Suite 200
>
> Durham, NC 27713
> e: clakhani@zaloni.com
> p: 919.602.4965 x7020
>
>


-- 

*Chirag Lakhani*

Data Scientist

Zaloni, Inc. | www.zaloni.com

633 Davis Dr., Suite 200

Durham, NC 27713
e: clakhani@zaloni.com
p: 919.602.4965 x7020

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message