nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2581) Caching of redirected robots.txt may overwrite correct robots.txt rules
Date Fri, 08 Jun 2018 09:51:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16505879#comment-16505879
] 

ASF GitHub Bot commented on NUTCH-2581:
---------------------------------------

sebastian-nagel closed pull request #342: NUTCH-2581 Caching of redirected robots.txt may
overwrite correct robots.txt rules (fix for 2.x)
URL: https://github.com/apache/nutch/pull/342
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
index 2af6fa577..08ec39f7c 100644
--- a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
+++ b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
@@ -138,8 +138,10 @@ else if (response.getCode() >= 500) {
 
       if (cacheRule) {
         CACHE.put(cacheKey, robotRules); // cache rules for host
-        if (redir != null && !redir.getHost().equalsIgnoreCase(url.getHost())) {
+        if (redir != null && !redir.getHost().equalsIgnoreCase(url.getHost())
+            && "/robots.txt".equals(redir.getFile())) {
           // cache also for the redirected host
+          // if the URL path is /robots.txt
           CACHE.put(getCacheKey(redir), robotRules);
         }
       }


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Caching of redirected robots.txt may overwrite correct robots.txt rules
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-2581
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2581
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, robots
>    Affects Versions: 2.3.1, 1.14
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Critical
>             Fix For: 2.4, 1.15
>
>
> Redirected robots.txt rules are also cached for the target host. That may cause that
the correct robots.txt rules are never fetched. E.g., http://wyomingtheband.com/robots.txt
redirects to https://www.facebook.com/wyomingtheband/robots.txt. Because fetching fails with
a 404 bots are allowed to crawl wyomingtheband.com. The rules is erroneously also cached for
the redirect target host www.facebook.com which is clear regarding their [robots.txt|https://www.facebook.com/robots.txt]
rules and does not allow crawling.
> Nutch may cache redirected robots.txt rules only if the path part (in doubt, including
the query) of the redirect target URL is exactly {{/robots.txt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message