nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Giuseppe Totaro (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist
Date Wed, 22 Apr 2015 06:36:58 GMT
Giuseppe Totaro created NUTCH-1995:
--------------------------------------

             Summary: Add support for wildcard to http.robot.rules.whitelist
                 Key: NUTCH-1995
                 URL: https://issues.apache.org/jira/browse/NUTCH-1995
             Project: Nutch
          Issue Type: Improvement
          Components: robots
    Affects Versions: 1.10
            Reporter: Giuseppe Totaro


The {{http.robot.rules.whitelist}} configuration parameter allows to specify a comma separated
list of hostnames or IP addresses to ignore robot rules parsing for.
Adding support for wildcard in {{http.robot.rules.whitelist}} could be very useful and simplify
the configuration, for example, if we need to give many hostnames/addresses. Here is an example:
{noformat}
<name>http.robot.rules.whitelist</name>
  <value>*.sample.com</value>
  <description>Comma separated list of hostnames or IP addresses to ignore 
  robot rules parsing for. Use with care and only if you are explicitly
  allowed by the site owner to ignore the site's robots.txt!
  </description>
</property>
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message