nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: fetch performance
Date Fri, 09 Sep 2005 19:20:45 GMT
AJ wrote:
> I tried to run 10 cycles of fetch/updatabs.  In the 3rd cycle, the fetch 
> list had 8810 urls.  Fetch ran pretty fast on my laptop before 4000 
> pages were fetched. After 4000 pages, it suddenly switched to very slow 
> speed, about 30 mins for just 100 pages.  My laptop also started to run 
> at 100% CPU all the time. Is there a threshold for fetch list size, 
> above which fetch performance will be degraded? Or it was because my 
> laptop? I know "-topN" option can control the fetch size. But, topN=4000 
> seems too small because it will end up thousands of segments.  Is there 
> a good rule of thumb for topN setting ?
> A related question is how big a segment should be in order to keep the 
> number of segments small without hitting fetch performance too much. For 
> example, to crawl 1 million pages in one run (has many fetch cycles), 
> what will be a good limit for each fetch list?

There are no artificial limits like that - I'm routinely fetching 
segments of 1 mln pages. Most likely what happened to you is that:

* you are using Nutch version with PDFBox 0.7.1 or below

* you fetched a rare kind of PDF, which puts PDFBox in a tight loop

* the thread that got stuck is consuming 99% of your CPU. :-)

Solution: upgrade PDFBox to the yet unreleased 0.7.2 .

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message