nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "WhiteListRobots" by ChrisMattmann
Date Wed, 15 Apr 2015 22:47:34 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "WhiteListRobots" page has been changed by ChrisMattmann:
https://wiki.apache.org/nutch/WhiteListRobots?action=diff&rev1=2&rev2=3

  
  Nutch now has a [[https://issues.apache.org/jira/browse/NUTCH-1927|white list for robots.txt]]
capability that can be used to selectively on a per host and/or IP basis turn on/off robots.txt
parsing. Read on to find out how to use it.
  
- = List hostnames and/or IP addresses in Nutch conf = 
+ == List hostnames and/or IP addresses in Nutch conf ==
  
  In the Nutch configuration directory (conf/), edit nutch-default.xml (and/or nutch-site.xml)
and add the following information:
  
@@ -28, +28 @@

  </property>
  }}}
  
- = Testing the configuration =
+ == Testing the configuration ==
  
  Create a sample URLs file to test your whitelist. For example, create a file, call it "url"
(without the quotes) and store each URL on a line:
  
@@ -44, +44 @@

  Disallow: /
  }}}
  
- = Build the Nutch runtime and execute RobotRulesParser =
+ == Build the Nutch runtime and execute RobotRulesParser ==
  
  Now, build the Nutch runtime, e.g., by running ```ant runtime```.
  From your ```runtime/local/```` directory, run this command:

Mime
View raw message