nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Podunavac <david.poduna...@wyona.com>
Subject reading crawl dir from nutch-default.xml
Date Fri, 25 Aug 2006 14:26:29 GMT
Hi

i think this patch will make it way easier to configure nutch, crawl dir
will be read from
nutch-default.xml instead of a relative path from where it has been executed
So nutch-default.xml will have its
<property>
  <name>searcher.dir</name>
  <value>PATH_TO_CRAWL_DIR</value>
  <description>
and this value will be used instead

Index: nutch-0.8/src/java/org/apache/nutch/crawl/Crawl.java
===================================================================
--- nutch-0.8/src/java/org/apache/nutch/crawl/Crawl.java       
(Revision 436809)
+++ nutch-0.8/src/java/org/apache/nutch/crawl/Crawl.java       
(Arbeitskopie)
@@ -53,10 +53,12 @@

     Configuration conf = NutchConfiguration.create();
     conf.addDefaultResource("crawl-tool.xml");
+    conf.addDefaultResource("nutch-default.xml");
     JobConf job = new NutchJob(conf);

     Path rootUrlDir = null;
-    Path dir = new Path("crawl-" + getDate());
+    String path2crawlDir = conf.get("searcher.dir");
+    Path dir = new Path(path2crawlDir);
     int threads = job.getInt("fetcher.threads.fetch", 10);
     int depth = 5;
     int topN = Integer.MAX_VALUE;


and this patch will make the CrawlDbReader find that crawl directory

Index: nutch-0.8/src/java/org/apache/nutch/crawl/CrawlDbReader.java
===================================================================
--- nutch-0.8/src/java/org/apache/nutch/crawl/CrawlDbReader.java       
(Revision 436809)
+++ nutch-0.8/src/java/org/apache/nutch/crawl/CrawlDbReader.java       
(Arbeitskopie)
@@ -406,8 +406,10 @@
       return;
     }
     String param = null;
-    String crawlDb = args[0];
+    //String crawlDb = args[0];
     Configuration conf = NutchConfiguration.create();
+    conf.addDefaultResource("nutch-default.xml");
+    String crawlDb = conf.get("searcher.dir") + "/crawldb";
     for (int i = 1; i < args.length; i++) {
       if (args[i].equals("-stats")) {
         dbr.processStatJob(crawlDb, conf);



WDYT

thanks

David

Mime
View raw message