nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: (NUTCH-339) Refactor nutch to allow fetcher improvements
Date Fri, 04 Aug 2006 16:44:15 GMT
Sami Siren (JIRA) wrote:
> I am not sure to what you refer to by this 3-4 sec but yes I agree threre are more aspects
to optimize in fetcher, what I was firstly concerned was the fetching IO speed what was getting
ridiculously low (not quite sure when this happened).

Depending on the number of map/reduce tasks, there is a framework 
overhead to transfer the job JAR file, and start the subprocess on each 
tasktracker. However, once these are started the framework's overhead 
should be negligible, because single task is responsible for fetching 
many urls.

Naturally, for small jobs, with very few urls, the overhead is 
relatively large.

The symptoms I'm seeing is that eventually most threads end up in 
blockAddr spin-waiting. Another problem I see is that when the number of 
fetching threads is high relative to the available bandwidth, the data 
is trickling in so slowly that the decides that it's hung, 
and aborts the task. What happens then is that the task gets a SUCCEEDED 
status in tasktracker, although in reality it may have fetched only a 
small portion of the allotted fetchlist.

> We should open more than one ticket to track these separate aspects. And for general
discussion the mailing lista are perhaps the best place.
(I'm moving this to the list then).

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message