nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enrique Berlanga (JIRA)" <j...@apache.org>
Subject [jira] Closed: (NUTCH-938) Imposible to fetch sites with robots.txt
Date Tue, 30 Nov 2010 09:35:11 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Enrique Berlanga closed NUTCH-938.
----------------------------------

    Resolution: Won't Fix

Resolved as "Won't Fix" acording to Andrzej Bialecki.

> Imposible to fetch sites with robots.txt 
> -----------------------------------------
>
>                 Key: NUTCH-938
>                 URL: https://issues.apache.org/jira/browse/NUTCH-938
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: red hat, nutch 1.2, jaca 1.6
>            Reporter: Enrique Berlanga
>         Attachments: NUTCH-938.patch
>
>
> Crawling a site with a robots.txt file like this:  (e.g: http://www.melilla.es)
> -------------------
> User-agent: *
> Disallow: /
> -------------------
> No links are followed. 
> It doesn't matters the value set at "protocol.plugin.check.blocking" or "protocol.plugin.check.robots"
properties, because they are overloaded in class org.apache.nutch.fetcher.Fetcher:
> // set non-blocking & no-robots mode for HTTP protocol plugins.
>     getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
>     getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
> False is the desired value, but in FetcherThread inner class, robot rules are checket
ignoring the configuration:
> ----------------
> RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
> if (!rules.isAllowed(fit.u)) {
>  ...
> LOG.debug("Denied by robots.txt: " + fit.url);
> ...
> continue;
> }
> -----------------------
> I suposse there is no problem in disabling that part of the code directly for HTTP protocol.
If so, I could submit a patch as soon as posible to get over this.
> Thanks in advance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message