nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From UroŇ° Gruber <uros.gru...@sir-mag.com>
Subject Re: (NUTCH-339) Refactor nutch to allow fetcher improvements
Date Fri, 04 Aug 2006 17:55:11 GMT
Sami Siren wrote:
> UroŇ° Gruber wrote:
>
>> Andrzej Bialecki wrote:
>>
>>> Sami Siren (JIRA) wrote:
>>>
>>>> I am not sure to what you refer to by this 3-4 sec but yes I agree 
>>>> threre are more aspects to optimize in fetcher, what I was firstly 
>>>> concerned was the fetching IO speed what was getting ridiculously 
>>>> low (not quite sure when this happened).
>>>>   
>>>
>>>
>> I set DEBUG level loging and I've checked time during operations and 
>> when doint MapReduce job which is run after every page it takes 3-4 
>> seconds till next url is fethed.
>> I have some local site and fetching 100 pages takes about 6 minutes.
>
> Even I havent's seen it go that slow :)
>
Lucky me ;)
>>> Depending on the number of map/reduce tasks, there is a framework 
>>> overhead to transfer the job JAR
>>>
>> I would like to help find what cause such slowness. Version 0.7 did 
>> not use MapReduce and fetching was done about 20 pages per second on 
>> the same server. With same site fetching is reduced to 0.3 pages per 
>> second.
>
> With queue based solution I just did a crawl of about 600k pages and 
> it averaged 16 pps (1376 kb/s) with parsing enabled. Perhaps you could 
> try Andrzejs new Fetcher and see how it performs for you (I haven't 
> yet read the code ot tested it my self).
>
I'll try it, but first I need to test it on java 1.4.2. Maybe the 
problem is with OS itself. I'll report bask as soon as I have more test.

regards

Uros

Mime
View raw message