nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enrique Berlanga (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-938) Imposible to fetch sites with robots.txt
Date Thu, 25 Nov 2010 11:46:13 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935728#action_12935728
] 

Enrique Berlanga commented on NUTCH-938:
----------------------------------------

Thanks for your answer. I agree with you that Nutch as a project cannot encourage such practice,
but maybe some code in Protocol or Fetcher class need to be removed from official source.
If not, It's hard to understand why this lines appear in the main method of the class ...
--------
// set non-blocking & no-robots mode for HTTP protocol plugins.
getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
--------
... and later in fetcher thread that values are ignored.
Maybe some notes in crawl-urlfilter.txt showing these properties as deprecated would be great.

My question is: Is there any reason to force it to false? A well-behaved crawler that obeys
robot rules and netiquette must force it to true, what makes me being a little confused about
that part of the code. I would prefer to feel free to change the behaviour by changing "protocol.plugin.check.robots"
value in crawl-urlfilter.txt file.
Thanks in advance

> Imposible to fetch sites with robots.txt 
> -----------------------------------------
>
>                 Key: NUTCH-938
>                 URL: https://issues.apache.org/jira/browse/NUTCH-938
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: red hat, nutch 1.2, jaca 1.6
>            Reporter: Enrique Berlanga
>         Attachments: NUTCH-938.patch
>
>
> Crawling a site with a robots.txt file like this:  (e.g: http://www.melilla.es)
> -------------------
> User-agent: *
> Disallow: /
> -------------------
> No links are followed. 
> It doesn't matters the value set at "protocol.plugin.check.blocking" or "protocol.plugin.check.robots"
properties, because they are overloaded in class org.apache.nutch.fetcher.Fetcher:
> // set non-blocking & no-robots mode for HTTP protocol plugins.
>     getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
>     getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
> False is the desired value, but in FetcherThread inner class, robot rules are checket
ignoring the configuration:
> ----------------
> RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
> if (!rules.isAllowed(fit.u)) {
>  ...
> LOG.debug("Denied by robots.txt: " + fit.url);
> ...
> continue;
> }
> -----------------------
> I suposse there is no problem in disabling that part of the code directly for HTTP protocol.
If so, I could submit a patch as soon as posible to get over this.
> Thanks in advance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message