nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lincoln Ritter" <linc...@lincolnritter.com>
Subject SegmentMerger "no input paths" problem and "special files/directories"
Date Wed, 11 Jun 2008 22:25:48 GMT
Greetings,

I'm running nutch trunk with the patch for hadoop 0.17 from NUTCH-634
(http://issues.apache.org/jira/browse/NUTCH-634)

I've run into a problem merging segments:

$ ./bin/nutch mergesegs crawl/segments_merge -dir crawl/segments/
08/06/11 14:32:35 INFO segment.SegmentMerger: Merging 3 segments to
crawl/segments_merge/20080611143235
08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger:   adding
hdfs://localhost:54310/user/lritter/crawl/segments/20080611135945
08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger:   adding
hdfs://localhost:54310/user/lritter/crawl/segments/20080611141414
08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger:   adding
hdfs://localhost:54310/user/lritter/crawl/segments/_logs
08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger: using
segment data from:
java.io.IOException: No input paths specified in input
	at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:173)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
	at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:605)
	at org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:648)

This looks to be the same (or similar) issue as:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg10999.html

In my case, the merger seems to think that the '_log' directory is
valid fodder for merging.  This is "clearly" not the case.  In this
case, I assume that underscore-prefixed names are "reserved" by nutch.
 With this assumption, I can make a filter that screens these out.  I
have done this and attached a patch against trunk below.

While the patch fixes my immediate problem it makes me a little
nervous that I am designating underscore-prefixed stuff as "special"
in a pretty adhoc way. Is there any "real" way to determine whether or
not a directory contains segment information?

Thanks!

-lincoln

--
lincolnritter.com

--- PATCH ---

Index: src/java/org/apache/nutch/segment/SegmentMerger.java
===================================================================
--- src/java/org/apache/nutch/segment/SegmentMerger.java	(revision 666871)
+++ src/java/org/apache/nutch/segment/SegmentMerger.java	(working copy)
@@ -626,7 +626,7 @@
     boolean normalize = false;
     for (int i = 1; i < args.length; i++) {
       if (args[i].equals("-dir")) {
-        Path[] files = fs.listPaths(new Path(args[++i]),
HadoopFSUtil.getPassDirectoriesFilter(fs));
+        Path[] files = fs.listPaths(new Path(args[++i]),
HadoopFSUtil.getPassNormalDirectoriesFilter(fs));
         for (int j = 0; j < files.length; j++)
           segs.add(files[j]);
       } else if (args[i].equals("-filter")) {
Index: src/java/org/apache/nutch/util/HadoopFSUtil.java
===================================================================
--- src/java/org/apache/nutch/util/HadoopFSUtil.java	(revision 666871)
+++ src/java/org/apache/nutch/util/HadoopFSUtil.java	(working copy)
@@ -51,6 +51,23 @@

         };
     }
+
+    /**
+     * Returns PathFilter that passes directories that are not
"special" through.
+     */
+    public static PathFilter getPassNormalDirectoriesFilter(final
FileSystem fs) {
+        return new PathFilter() {
+            public boolean accept(final Path path) {
+                try {
+										FileStatus status = fs.getFileStatus(path);
+                    return status.isDir() &&
!status.getPath().getName().startsWith("_");
+                } catch (IOException ioe) {
+                    return false;
+                }
+            }
+
+        };
+    }

     /**
      * Turns an array of FileStatus into an array of Paths.

--- END PATCH ---

Mime
View raw message