nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-1455) RobotRulesParser to match multi-word user-agent names
Date Tue, 14 Aug 2012 22:02:37 GMT
Sebastian Nagel created NUTCH-1455:
--------------------------------------

             Summary: RobotRulesParser to match multi-word user-agent names
                 Key: NUTCH-1455
                 URL: https://issues.apache.org/jira/browse/NUTCH-1455
             Project: Nutch
          Issue Type: Bug
          Components: protocol
    Affects Versions: 1.5.1
            Reporter: Sebastian Nagel


If the user-agent name(s) configured in http.robots.agents contains spaces it is not matched
even if is exactly contained in the robots.txt

http.robots.agents = "Download Ninja,*"

If the robots.txt (http://en.wikipedia.org/robots.txt) contains
{code}
User-agent: Download Ninja
Disallow: /
{code}
all content should be forbidden. But it isn't:
{code}
% curl 'http://en.wikipedia.org/robots.txt' > robots.txt
% grep -A1 -i ninja robots.txt 
User-agent: Download Ninja
Disallow: /
% cat test.urls
http://en.wikipedia.org/
% bin/nutch plugin lib-http org.apache.nutch.protocol.http.api.RobotRulesParser robots.txt
test.urls 'Download Ninja'
...
allowed:        http://en.wikipedia.org/
{code}

The rfc (http://www.robotstxt.org/norobots-rfc.txt) states that
bq. The robot must obey the first record in /robots.txt that contains a User-Agent line whose
value contains the name token of the robot as a    substring.
Assumed that "Downlaod Ninja" is a substring of itself it should match and http://en.wikipedia.org/
should be forbidden.

The point is that the agent name from the User-Agent line is split at spaces while the names
from the http.robots.agents property are not (they are only split at ",").


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message