nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-938) Imposible to fetch sites with robots.txt
Date Thu, 25 Nov 2010 12:50:14 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935745#action_12935745
] 

Andrzej Bialecki  commented on NUTCH-938:
-----------------------------------------

These two properties are documented in nutch-default.xml, but they are mostly for internal
use by Nutch. Other implementations of Fetcher (the OldFetcher) used to delegate the robot
and politeness controls to protocol plugins. The current implementation of Fetcher performs
these tasks itself, although in 1.2 protocol plugins still retain the code to implement these
controls per protocol. In 1.3 (unreleased) and trunk this support has been removed from protocol
plugins, so these lines will have no effect.

> Imposible to fetch sites with robots.txt 
> -----------------------------------------
>
>                 Key: NUTCH-938
>                 URL: https://issues.apache.org/jira/browse/NUTCH-938
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: red hat, nutch 1.2, jaca 1.6
>            Reporter: Enrique Berlanga
>         Attachments: NUTCH-938.patch
>
>
> Crawling a site with a robots.txt file like this:  (e.g: http://www.melilla.es)
> -------------------
> User-agent: *
> Disallow: /
> -------------------
> No links are followed. 
> It doesn't matters the value set at "protocol.plugin.check.blocking" or "protocol.plugin.check.robots"
properties, because they are overloaded in class org.apache.nutch.fetcher.Fetcher:
> // set non-blocking & no-robots mode for HTTP protocol plugins.
>     getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
>     getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
> False is the desired value, but in FetcherThread inner class, robot rules are checket
ignoring the configuration:
> ----------------
> RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
> if (!rules.isAllowed(fit.u)) {
>  ...
> LOG.debug("Denied by robots.txt: " + fit.url);
> ...
> continue;
> }
> -----------------------
> I suposse there is no problem in disabling that part of the code directly for HTTP protocol.
If so, I could submit a patch as soon as posible to get over this.
> Thanks in advance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message