nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <>
Subject [jira] [Created] (NUTCH-1752) cache robots.txt rules per protocol:host:port
Date Wed, 09 Apr 2014 09:01:31 GMT
Sebastian Nagel created NUTCH-1752:

             Summary: cache robots.txt rules per protocol:host:port
                 Key: NUTCH-1752
             Project: Nutch
          Issue Type: Bug
          Components: protocol
    Affects Versions: 2.2.1, 1.8
            Reporter: Sebastian Nagel
             Fix For: 2.3, 1.9

HttpRobotRulesParser caches rules from {{robots.txt}} per "protocol:host" (before NUTCH-1031
caching was per "host" only). The caching should be per "protocol:host:port". In doubt, a
request to a different port may deliver a different {{robots.txt}}. 
Applying robots.txt rules to a combination of host, protocol, and port is common practice:
[Norobots RFC 1996 draft|] does not mention this
explicitly (could be derived from examples) but others do:
* [Wikipedia|]: "each protocol and port needs its own
robots.txt file"
* [Google webmasters|]:
"The directives listed in the robots.txt file apply only to the host, protocol and port number
where the file is hosted."

This message was sent by Atlassian JIRA

View raw message