nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <>
Subject [jira] [Created] (NUTCH-1455) RobotRulesParser to match multi-word user-agent names
Date Tue, 14 Aug 2012 22:02:37 GMT
Sebastian Nagel created NUTCH-1455:

             Summary: RobotRulesParser to match multi-word user-agent names
                 Key: NUTCH-1455
             Project: Nutch
          Issue Type: Bug
          Components: protocol
    Affects Versions: 1.5.1
            Reporter: Sebastian Nagel

If the user-agent name(s) configured in http.robots.agents contains spaces it is not matched
even if is exactly contained in the robots.txt

http.robots.agents = "Download Ninja,*"

If the robots.txt ( contains
User-agent: Download Ninja
Disallow: /
all content should be forbidden. But it isn't:
% curl '' > robots.txt
% grep -A1 -i ninja robots.txt 
User-agent: Download Ninja
Disallow: /
% cat test.urls
% bin/nutch plugin lib-http org.apache.nutch.protocol.http.api.RobotRulesParser robots.txt
test.urls 'Download Ninja'

The rfc ( states that
bq. The robot must obey the first record in /robots.txt that contains a User-Agent line whose
value contains the name token of the robot as a    substring.
Assumed that "Downlaod Ninja" is a substring of itself it should match and
should be forbidden.

The point is that the agent name from the User-Agent line is split at spaces while the names
from the http.robots.agents property are not (they are only split at ",").

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message