nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Retire the original Fetcher before the release?
Date Mon, 17 Mar 2008 14:20:11 GMT
Dennis Kubes wrote:
> We continue to run on Fetcher1.

Since you're running large crawls, could you run one of them with 
Fetcher2 and comment on the results? Note that Fetcher2 needs a lot 
fewer threads than Fetcher - usually running a large crawl with < 100 
threads is more than sufficient.

>  What are the benefits of moving to 
> Fetcher2.  Not opposed to it, just hadn't thought about it yet as 
> Fetcher1 seemed to be working fine for us?

Politeness is implemented and enforced in Fetcher2 instead of in 
protocol plugin. This means that the same blocking code can be reused 
for any protocol (ftp, file, etc). Fetcher2 handles the "long tail" 
problem in a better way - the old Fetcher would spin-wait threads until 
the host becomes available, Fetcher2 reuses threads to handle work items 
from other host queues. Fetcher2 follows a cleaner producer/consumer 
model with per-host queues, which makes it more suitable for extensions. 
Example: one of the extensions that I implemented in a private code was 
to add host queue monitoring for rates of errors, types of errors, 
download speed etc, and adjusting fetching parameters based on that. 
Implementing this in the old Fetcher would be a nightmare.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message