nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Fetcher2
Date Wed, 17 Jan 2007 21:18:15 GMT
Hi all,

I just committed a new implementation of venerable fetcher, called 
Fetcher2. It uses a producer/consumers model with a set of per-host 
queues. Theoretically it should be able to achieve a much higher 
throughput, especially for fetchlists with a lot of contention (many 
urls from the same hosts).

It should be possible to achieve the same fetching rate with a smaller 
number of threads, and most importantly to avoid the dreaded "Exceeded 
http.max.delays: retry later" error.

It is available through "bin/nutch fetch2".

 From the javadoc:

"A queue-based fetcher.

This fetcher uses a well-known model of one producer (a QueueFeeder) and 
many consumers (FetcherThread-s).

QueueFeeder reads input fetchlists and populates a set of 
FetchItemQueue-s, which hold FetchItem-s that describe the items to be 
fetched. There are as many queues as there are unique hosts, but at any 
given time the total number of fetch items in all queues is less than a 
fixed number (currently set to a multiple of the number of threads).

As items are consumed from the queues, the QueueFeeder continues to add 
new input items, so that their total count stays fixed (FetcherThread-s 
may also add new items to the queues e.g. as a results of redirection) - 
until all input items are exhausted, at which point the number of items 
in the queues begins to decrease. When this number reaches 0 fetcher 
will finish.

This fetcher implementation handles per-host blocking itself, instead of 
delegating this work to protocol-specific plugins. Each per-host queue 
handles its own "politeness" settings, such as the maximum number of 
concurrent requests and crawl delay between consecutive requests - and 
also a list of requests in progress, and the time the last request was 
finished. As FetcherThread-s ask for new items to be fetched, queues may 
return eligible items or null if for "politeness" reasons this host's 
queue is not yet ready.

If there are still unfetched items on the queues, but none of the items 
are ready, FetcherThread-s will spin-wait until either some items become 
available, or a timeout is reached (at which point the Fetcher will 
abort, assuming the task is hung)."

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message