nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (Jira)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2754) fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.
Date Mon, 23 Dec 2019 10:58:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002194#comment-17002194
] 

ASF GitHub Bot commented on NUTCH-2754:
---------------------------------------

sebastian-nagel commented on pull request #487: NUTCH-2754 fetcher.max.crawl.delay ignored
if exceeding 5 min. / 300 sec.
URL: https://github.com/apache/nutch/pull/487
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.
> --------------------------------------------------------------
>
>                 Key: NUTCH-2754
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2754
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, robots
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>
>
> Sites specifying a Crawl-Delay of more than 5 minutes (301 seconds or more) are always
ignored, even if fetcher.max.crawl.delay is set to a higher value.
> We need to pass a higher value of fetcher.max.crawl.delay to [crawler-commons' robots.txt
parser|https://github.com/crawler-commons/crawler-commons/blob/c9c0ac6eda91b13d534e69f6da3fd15065414fb0/src/main/java/crawlercommons/robots/SimpleRobotRulesParser.java#L78]
otherwise it will use the internal default value of 300 sec. and disallow all sites specifying
a longer Crawl-Delay in their robots.txt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message