nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@nutch.org>
Subject Re: how to deal with large/slow sites
Date Mon, 12 Sep 2005 17:23:30 GMT
AJ Chen wrote:
> Two questions:
> (1) Is there a better approach to deep-crawl large sites?

If a site with N pages which require T seconds each on average to fetch, 
then fetching the entire site will require N*T seconds.  If that's 
longer than you're willing to wait then you'll won't be able to fetch 
the entire site.  If you are willing to wait, then set http.max.delays 
to Integer.MAX_VALUE and wait.  In this case there's no shortcut.

> (2) Will the dropped urls be picked up again in subsequent cycles of 
> fetchlist/segment/fetch/updatedb?

They will be retried in the next cycle, up to db.fetch.retry.max.

Doug

Mime
View raw message