nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ <cano...@gmail.com>
Subject fetch performance
Date Fri, 09 Sep 2005 17:48:44 GMT
I tried to run 10 cycles of fetch/updatabs.  In the 3rd cycle, the fetch 
list had 8810 urls.  Fetch ran pretty fast on my laptop before 4000 
pages were fetched. After 4000 pages, it suddenly switched to very slow 
speed, about 30 mins for just 100 pages.  My laptop also started to run 
at 100% CPU all the time. Is there a threshold for fetch list size, 
above which fetch performance will be degraded? Or it was because my 
laptop? I know "-topN" option can control the fetch size. But, topN=4000 
seems too small because it will end up thousands of segments.  Is there 
a good rule of thumb for topN setting ?

A related question is how big a segment should be in order to keep the 
number of segments small without hitting fetch performance too much. For 
example, to crawl 1 million pages in one run (has many fetch cycles), 
what will be a good limit for each fetch list?

Thanks,
AJ




Mime
View raw message