nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <>
Subject [jira] [Updated] (NUTCH-1752) cache robots.txt rules per protocol:host:port
Date Fri, 25 Apr 2014 21:39:14 GMT


Sebastian Nagel updated NUTCH-1752:

    Attachment: NUTCH-1752-v2.patch

Attached reviewed patch v2. Changed/fixed caching of robot rules of redirected robots.txt:
* patch v1 introduced bug: cache key for redirect target is not properly constructed
* for redirected cache key: use protocol and port of the redirect target. E.g., if https://host1/robots.txt
redirects to http://host2/robots.txt: rules from the latter are cached for "https:host1:443"
and "http:host2:80".

> cache robots.txt rules per protocol:host:port
> ---------------------------------------------
>                 Key: NUTCH-1752
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 1.8, 2.2.1
>            Reporter: Sebastian Nagel
>             Fix For: 2.3, 1.9
>         Attachments: NUTCH-1752-v1.patch, NUTCH-1752-v2.patch
> HttpRobotRulesParser caches rules from {{robots.txt}} per "protocol:host" (before NUTCH-1031
caching was per "host" only). The caching should be per "protocol:host:port". In doubt, a
request to a different port may deliver a different {{robots.txt}}. 
> Applying robots.txt rules to a combination of host, protocol, and port is common practice:
> [Norobots RFC 1996 draft|] does not mention
this explicitly (could be derived from examples) but others do:
> * [Wikipedia|]: "each protocol and port needs
its own robots.txt file"
> * [Google webmasters|]:
"The directives listed in the robots.txt file apply only to the host, protocol and port number
where the file is hosted."

This message was sent by Atlassian JIRA

View raw message