nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From UroŇ° Gruber <>
Subject Re: (NUTCH-339) Refactor nutch to allow fetcher improvements
Date Fri, 04 Aug 2006 17:55:11 GMT
Sami Siren wrote:
> UroŇ° Gruber wrote:
>> Andrzej Bialecki wrote:
>>> Sami Siren (JIRA) wrote:
>>>> I am not sure to what you refer to by this 3-4 sec but yes I agree 
>>>> threre are more aspects to optimize in fetcher, what I was firstly 
>>>> concerned was the fetching IO speed what was getting ridiculously 
>>>> low (not quite sure when this happened).
>> I set DEBUG level loging and I've checked time during operations and 
>> when doint MapReduce job which is run after every page it takes 3-4 
>> seconds till next url is fethed.
>> I have some local site and fetching 100 pages takes about 6 minutes.
> Even I havent's seen it go that slow :)
Lucky me ;)
>>> Depending on the number of map/reduce tasks, there is a framework 
>>> overhead to transfer the job JAR
>> I would like to help find what cause such slowness. Version 0.7 did 
>> not use MapReduce and fetching was done about 20 pages per second on 
>> the same server. With same site fetching is reduced to 0.3 pages per 
>> second.
> With queue based solution I just did a crawl of about 600k pages and 
> it averaged 16 pps (1376 kb/s) with parsing enabled. Perhaps you could 
> try Andrzejs new Fetcher and see how it performs for you (I haven't 
> yet read the code ot tested it my self).
I'll try it, but first I need to test it on java 1.4.2. Maybe the 
problem is with OS itself. I'll report bask as soon as I have more test.



View raw message