nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-753) Prevent new Fetcher to retrieve the robots twice
Date Mon, 07 Sep 2009 18:33:57 GMT
Prevent new Fetcher to retrieve the robots twice
------------------------------------------------

                 Key: NUTCH-753
                 URL: https://issues.apache.org/jira/browse/NUTCH-753
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 1.0.0
            Reporter: Julien Nioche


The new Fetcher which is now used by default handles the robots file directly instead of relying
on the protocol. The options Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS are set to
false to prevent fetching the robots.txt twice (in Fetcher + in protocol), which avoids calling
robots.isAllowed. However in practice the robots file is still fetched as there is a call
to robots.getCrawlDelay() a bit further which is not covered by the if (Protocol.CHECK_ROBOTS).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message