nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2055) Random Crawl Delay
Date Wed, 01 Jul 2015 15:42:04 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14610487#comment-14610487
] 

Sebastian Nagel commented on NUTCH-2055:
----------------------------------------

Hi Talat,
* do you really want a random but fixed crawl delay for every host? Isn't it about randomizing
the intervals between accessing the same host? For the latter case nextFetchTime in FetchItemQueue
needs to be set to a random value after each fetch/access, probably from FetchItemQueue.setEndTime().
* shouldn't the random delay be chosen between "fetcher.server.delay" and "fetcher.max.crawl.delay"?
Just to guarantee a certain minimum delay. In case multiple FetcherThreads are accessing the
same host ("fetcher.threads.per.queue" > 1), the minimum is consequently "fetcher.server.min.delay".


> Random Crawl Delay
> ------------------
>
>                 Key: NUTCH-2055
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2055
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 2.3
>            Reporter: Talat UYARER
>            Priority: Trivial
>             Fix For: 2.4
>
>         Attachments: NUTCH-2055.patch
>
>
> Some Firewalls can block that request with same delay time. I create a patch for random
crawl delay between 0 and max Crawl Delay settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message