nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "chee wu " <chee...@gmail.com>
Subject Re: Fetcher2
Date Mon, 22 Jan 2007 15:02:42 GMT
Fetcher2 should be a great help for me,but seems can't integrate with Nutch81.
Any advice on how to use it based on .81? 
----- Original Message ----- 
From: "Andrzej Bialecki" <ab@getopt.org>
To: <nutch-dev@lucene.apache.org>
Sent: Thursday, January 18, 2007 5:18 AM
Subject: Fetcher2


> Hi all,
> 
> I just committed a new implementation of venerable fetcher, called 
> Fetcher2. It uses a producer/consumers model with a set of per-host 
> queues. Theoretically it should be able to achieve a much higher 
> throughput, especially for fetchlists with a lot of contention (many 
> urls from the same hosts).
> 
> It should be possible to achieve the same fetching rate with a smaller 
> number of threads, and most importantly to avoid the dreaded "Exceeded 
> http.max.delays: retry later" error.
> 
> It is available through "bin/nutch fetch2".
> 
> From the javadoc:
> 
> "A queue-based fetcher.
> 
> This fetcher uses a well-known model of one producer (a QueueFeeder) and 
> many consumers (FetcherThread-s).
> 
> QueueFeeder reads input fetchlists and populates a set of 
> FetchItemQueue-s, which hold FetchItem-s that describe the items to be 
> fetched. There are as many queues as there are unique hosts, but at any 
> given time the total number of fetch items in all queues is less than a 
> fixed number (currently set to a multiple of the number of threads).
> 
> As items are consumed from the queues, the QueueFeeder continues to add 
> new input items, so that their total count stays fixed (FetcherThread-s 
> may also add new items to the queues e.g. as a results of redirection) - 
> until all input items are exhausted, at which point the number of items 
> in the queues begins to decrease. When this number reaches 0 fetcher 
> will finish.
> 
> This fetcher implementation handles per-host blocking itself, instead of 
> delegating this work to protocol-specific plugins. Each per-host queue 
> handles its own "politeness" settings, such as the maximum number of 
> concurrent requests and crawl delay between consecutive requests - and 
> also a list of requests in progress, and the time the last request was 
> finished. As FetcherThread-s ask for new items to be fetched, queues may 
> return eligible items or null if for "politeness" reasons this host's 
> queue is not yet ready.
> 
> If there are still unfetched items on the queues, but none of the items 
> are ready, FetcherThread-s will spin-wait until either some items become 
> available, or a timeout is reached (at which point the Fetcher will 
> abort, assuming the task is hung)."
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
>
Mime
View raw message