mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: DistributedRowMatrix transpose method problem
Date Sun, 12 Sep 2010 23:29:47 GMT
On Sun, Sep 12, 2010 at 3:37 PM, Abhijat Vatsyayan <
abhijat.vatsyayan@gmail.com> wrote:

> Thanks Jake. Wouldn't using  FileSystem.globStatus(Path pathPattern,
> PathFilter filter) along with a filter for ignoring directories be easier?
> transpose() could then additionally set the filter  in the transposed matrix
> (to ignore the directories). I have very little understanding/knowledge of
> the interface contracts so am not sure if it will break something else but
> will try a few things and see what works.
>

I'm sure you're right about that, the PathFilter for ignoring subdirectories
seems to be the right way to go when doing iterate()/iterateAll().  Those
are really pretty fragile / debuggy methods, as you typically will want to
access all of the rows of a DistributedRowMatrix as a MR job, not a local
HDFS SequenceFile iteration.

Feel free to open a JIRA ticket / post a patch, I'm sure we can get it
reviewed and committed quickly if we can verify it fixes this.

  -jake


> Abhijat
>
> On Sep 12, 2010, at 5:03 PM, Jake Mannix wrote:
>
> > Hi Abhijat,
> >
> >  It looks like you've found a bug not in transpose(), but in iterateAll()
> > (and probably iterate() ) - the file globbing of the contents of the
> > sequence file directory is grabbing the  "_logs" subdirectory
> automatically
> > created by hadoop and trying to treat that as a part of the SequenceFile,
> > which it is not.
> >
> >  Yep, line 207 of DistributedRowMatrix globs together anything in the
> > matrix row directory, that "*" should be more restrictive (maybe you can
> try
> > "part*" and recompile and see if your code works?).
> >
> >  -jake
> >
> > On Sun, Sep 12, 2010 at 1:25 PM, Abhijat Vatsyayan <
> > abhijat.vatsyayan@gmail.com> wrote:
> >
> >> I isolated a bug in my program to a place where I am using
> >> DistributedRowMatrix.transpose(). When I send a "transpose" message to a
> >> DistributedRowMatrix object, I see the mapper and reducer being started,
> and
> >> the method finishes without any errors but my attempt to read the
> contents
> >> of the (transposed) matrix fails. Seems like I am missing something
> really
> >> basic here but any help will be appreciated.
> >>
> >> Here is the test case code (imports, package statement and comments not
> >> shown):
> >> public class TestMatrixIO {
> >>       @Test
> >>       public  void testDistributedTranspose( ) throws Exception
> >>       {
> >>               Configuration cfg = new Configuration( );
> >>               DistributedRowMatrix matrix = new
> >>
> DistributedRowMatrix(TestWriteMatrix.INPUT_TEST_MATRIX_FILE,"input/tmp_1",
> >> 3,4);
> >>               matrix.configure(new JobConf(cfg));
> >>               int count = printMatrix(matrix); // prints OK ..
> >>
> >>
> System.out.println("[testReadingDistributedMatrix()]..NumElements="+count);
> >>               DistributedRowMatrix matrix_t = matrix.transpose();
> >>
> >> System.out.println("[testReadingDistributedMatrix()]..Transpose done");
> >>               printMatrix(matrix_t); // Fails
> >>       }
> >>       private static int printMatrix(DistributedRowMatrix matrix) {
> >>               Iterator<MatrixSlice> iterator = matrix.iterateAll();
> >>               int count = 0;
> >>               while(iterator.hasNext())
> >>               {
> >>                       MatrixSlice slice = iterator.next();
> >>                       Vector v = slice.vector();
> >>                       int size = v.size();
> >>                       for(int i=0;i<size;i++)
> >>                       {
> >>                               Element e = v.getElement(i);
> >>                               count++;
> >>                               System.out.print(e.get()+" ");
> >>                       }
> >>                       System.out.println();
> >>               }
> >>               return count;
> >>       }
> >> }
> >>
> >> The stack trace when I try to print the matrix on the last line of  the
> >> testDistributedTranspose method is :
> >> java.lang.IllegalStateException: java.io.IOException: Cannot open
> filename
> >> /user/abhijat/input/transpose-104/_logs
> >>       at
> >>
> org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:118)
> >>       at
> >>
> net.abhijat.hadoop.mr.testexec.TestMatrixIO.printMatrix(TestMatrixIO.java:28)
> >>       at
> >>
> net.abhijat.hadoop.mr.testexec.TestMatrixIO.testDistributedTranspose(TestMatrixIO.java:25)
> >>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>       at
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>       at
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>       at java.lang.reflect.Method.invoke(Method.java:597)
> >>       at
> >>
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
> >>       at
> >>
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
> >>       at
> >>
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
> >>       at
> >>
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
> >>       at
> >>
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
> >>       at
> >>
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
> >>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
> >>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
> >>       at
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
> >>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
> >>       at
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
> >>       at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
> >>       at
> >>
> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:45)
> >>       at
> >>
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
> >>       at
> >>
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
> >>       at
> >>
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
> >>       at
> >>
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
> >>       at
> >>
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
> >> Caused by: java.io.IOException: Cannot open filename
> >> /user/abhijat/input/transpose-104/_logs
> >>       at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
> >>       at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
> >>       at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
> >>       at
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
> >>       at
> >>
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
> >>       at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
> >>       at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> >>       at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> >>       at
> >>
> org.apache.mahout.math.hadoop.DistributedRowMatrix$DistributedMatrixIterator.<init>(DistributedRowMatrix.java:216)
> >>       at
> >>
> org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:116)
> >>       ... 24 more
> >>
> >>
> >> "hadoop fs -ls input" shows that the transpose job did create the
> directory
> >> and output files. I created the matrix file using following code
> (imports,
> >> package statement and comments not shown):
> >> public class TestWriteMatrix {
> >>       public static final String INPUT_TEST_MATRIX_FILE =
> >> "input/test.matrix.file";
> >>       public static final double[][] matrix_dat =
> >>       {
> >>               {1,3,-2,0},
> >>               {2,3,2,-9},
> >>               {-1,1,-4,10}
> >>       };
> >>       @Test
> >>       public void testWritingMatrix() throws Exception
> >>       {
> >>               Configuration cfg = new Configuration( );
> >>               FileSystem fs = FileSystem.get(cfg);
> >>               SequenceFile.Writer writer = SequenceFile.createWriter(fs,
> >> cfg, new Path(INPUT_TEST_MATRIX_FILE),
> >>                               IntWritable.class, VectorWritable.class) ;
> >>               for(int i=0;i<matrix_dat.length;i++)
> >>               {
> >>                       DenseVector  row = new DenseVector(matrix_dat[i]);
> >>                       VectorWritable vwritable = new
> VectorWritable(row);
> >>                       writer.append(new IntWritable(i), vwritable);
> >>               }
> >>               writer.close();
> >>       }
> >> }
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message