nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ Chen <>
Subject how to deal with large/slow sites
Date Sun, 11 Sep 2005 20:51:53 GMT
In vertical crawling, there are always some large sites that have tens 
of thousands of pages. Fetching a page from these large sites very often 
returns "retry later" because http.max.delays is exceeded.  Setting 
appropriate values for http.max.delays and fetcher.server.delay can 
minimize this kind of url dropping. However, with my application , I 
still see 20-50% urls got dropped from a few large sites even with 
pretty long delay setting, http.max.delays=20, fetcher.server.delay=5.0, 
effectively 100 sec per host.

Two questions:
(1) Is there a better approach to deep-crawl large sites?  Should we 
treat large sites differently from smaller sites?  I notice Doug and 
Andrzej had discussed potential solutions to this problem.  But, anybody 
has a good short-term solution?

(2) Will the dropped urls be picked up again in subsequent cycles of 
fetchlist/segment/fetch/updatedb?  If this is true, running more cycles 
should eventually fetch the dropped urls.  Does 
db.default.fetch.interval (default is 30 days) influence when the 
dropped urls will be fetched again?

Appreciate your advice.

View raw message