nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ Chen <cano...@gmail.com>
Subject Re: fetch performance
Date Sat, 10 Sep 2005 03:35:57 GMT
Hi Andrzej,
Thanks for the suggestion. I'm using pdf plugin that
comes with nutch from vsn.  Where to get the PDFBox
unreleased version 0.7.2 that works for you? 
-AJ



On 9/9/05, Andrzej Bialecki <ab@getopt.org> wrote:
> 
> AJ wrote:
> > I tried to run 10 cycles of fetch/updatabs. In the 3rd cycle, the fetch
> > list had 8810 urls. Fetch ran pretty fast on my laptop before 4000
> > pages were fetched. After 4000 pages, it suddenly switched to very slow
> > speed, about 30 mins for just 100 pages. My laptop also started to run
> > at 100% CPU all the time. Is there a threshold for fetch list size,
> > above which fetch performance will be degraded? Or it was because my
> > laptop? I know "-topN" option can control the fetch size. But, topN=4000
> > seems too small because it will end up thousands of segments. Is there
> > a good rule of thumb for topN setting ?
> >
> > A related question is how big a segment should be in order to keep the
> > number of segments small without hitting fetch performance too much. For
> > example, to crawl 1 million pages in one run (has many fetch cycles),
> > what will be a good limit for each fetch list?
> 
> There are no artificial limits like that - I'm routinely fetching
> segments of 1 mln pages. Most likely what happened to you is that:
> 
> * you are using Nutch version with PDFBox 0.7.1 or below
> 
> * you fetched a rare kind of PDF, which puts PDFBox in a tight loop
> 
> * the thread that got stuck is consuming 99% of your CPU. :-)
> 
> Solution: upgrade PDFBox to the yet unreleased 0.7.2 .
> 
> 
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
> 
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message