nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <>
Subject Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay
Date Fri, 02 Mar 2012 11:45:33 GMT
Hi Guys,

As there were some comments on the user list, I recently got digging with
http redirects then stumbled across NUTCH-1042. Although these are
individual issues e.g. redirects and crawl delays, I think they are
certainly linked, however what is interesting is that users 'usually' don't
consider them to be interlinked as such and therefore struggle to debug how
and why either the redirect or the crawl delay pages are not being fetched.

Doing some more digging I found the now rather old and tatty NUTCH-475,
which obviously got me thinking about how we maintain the
AdaptiveFetchSchedule for custom refetching. Now I begin to start thinking
about the following

- Regardless of whether we implement an AdaptiveCrawlDelay, NUTCH-1042
still needs fixed as this is obviously becoming a bit of a pain for some
- Can someone shine some light on what happened to that
Dogacan refers to? I was only ever accustomed to OldFetcher and Fetcher :0)
- For you guys managing/running/maintaining your own (and possibly
clients)  web servers, what are the perceptions of maintaining your own
AdaptiveCrawlDelay? Pro's and Con's (apart from the obvious)

I can't really think of anything else at the moment!




View raw message