nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.
Date Sat, 12 Jan 2013 16:48:12 GMT


Lewis John McGibbney commented on NUTCH-1284:

Hi Tejas. Nice catch btw as it looks like you've integrated NUTCH-1042 in to this patch as
With regards to the original issue here e.g. NUTCH-1284, it would be excellent if this issue
could also provide logging for the fetcher as originally stated in the issue description.
e.g. the log output records crawl.delay on a per url basis. I like the debug logging you've
added for the queue. Although it is not marked, IIRC this issue affects both 1.x and 2.x...

> Add site fetcher.max.crawl.delay as log output by default.
> ----------------------------------------------------------
>                 Key: NUTCH-1284
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>            Priority: Trivial
>             Fix For: 1.7, 2.2
>         Attachments: NUTCH-1284.patch
> Currently, when manually scanning our log output we cannot infer which pages are governed
by a crawl delay between successive fetch attempts of any given page within the site. The
value should be made available as something like:
> {code}
> 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching (crawl.delay=XXXms)
> {code}
> This way we can easily and quickly determine whether the fetcher is having to use this
functionality or not. 

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message