Giuseppe Totaro created NUTCH-1995:
--------------------------------------
Summary: Add support for wildcard to http.robot.rules.whitelist
Key: NUTCH-1995
URL: https://issues.apache.org/jira/browse/NUTCH-1995
Project: Nutch
Issue Type: Improvement
Components: robots
Affects Versions: 1.10
Reporter: Giuseppe Totaro
The {{http.robot.rules.whitelist}} configuration parameter allows to specify a comma separated
list of hostnames or IP addresses to ignore robot rules parsing for.
Adding support for wildcard in {{http.robot.rules.whitelist}} could be very useful and simplify
the configuration, for example, if we need to give many hostnames/addresses. Here is an example:
{noformat}
<name>http.robot.rules.whitelist</name>
<value>*.sample.com</value>
<description>Comma separated list of hostnames or IP addresses to ignore
robot rules parsing for. Use with care and only if you are explicitly
allowed by the site owner to ignore the site's robots.txt!
</description>
</property>
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
|