nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1752) cache robots.txt rules per protocol:host:port
Date Sun, 18 May 2014 03:45:16 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000978#comment-14000978
] 

Hudson commented on NUTCH-1752:
-------------------------------

SUCCESS: Integrated in Nutch-trunk #2630 (See [https://builds.apache.org/job/Nutch-trunk/2630/])
NUTCH-1752 Cache robots.txt rules per protocol:host:port (snagel: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1594071)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java


> cache robots.txt rules per protocol:host:port
> ---------------------------------------------
>
>                 Key: NUTCH-1752
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1752
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 1.8, 2.2.1
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>             Fix For: 2.3, 1.9
>
>         Attachments: NUTCH-1752-v1.patch, NUTCH-1752-v2.patch
>
>
> HttpRobotRulesParser caches rules from {{robots.txt}} per "protocol:host" (before NUTCH-1031
caching was per "host" only). The caching should be per "protocol:host:port". In doubt, a
request to a different port may deliver a different {{robots.txt}}. 
> Applying robots.txt rules to a combination of host, protocol, and port is common practice:
> [Norobots RFC 1996 draft|http://www.robotstxt.org/norobots-rfc.txt] does not mention
this explicitly (could be derived from examples) but others do:
> * [Wikipedia|http://en.wikipedia.org/wiki/Robots.txt]: "each protocol and port needs
its own robots.txt file"
> * [Google webmasters|https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt]:
"The directives listed in the robots.txt file apply only to the host, protocol and port number
where the file is hosted."



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message