nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doğacan Güney" <>
Subject Re: bug with generate performance
Date Fri, 07 Sep 2007 07:37:51 GMT

On 8/31/07, misc <> wrote:
> Hello-
>     I am almost certain I have found a nasty bug with nutch genereate.
>     Problem: Nutch generate can take many hours, even a day to complete (on a crawldb
that has less than 2 million urls).
>     I added debug code to Generator-> to see when map is called and returns,
and observed interesting behavior, described here:
>     1. Most of the time, when generate is run urls are processed in chunky batches, usually
about 40 at a time, followed by a 1 second delay.  I timed the delay, and it really is a 1
second delay (ie- 30 batches was 30 seconds.)  When this happens it takes hours to complete.
>     2. Sometimes (randomly as far as I can tell) when I run nutch, the urls are processed
without delays.  It is an all or nothing event, either I run and all urls process quickly
without delay (in minutes), or more likely I get the chunky processing with many 1 second
delays and the program takes hours to end.  The one exception is....
>     3. When the processing runs quickly I've seen the main thread end (I have some profiling
going, so I know when a thread ends), and then more likely than not a second thread begins
where the first starts, chunky like usual.  Although I sometimes can get fast processing in
one thread, it is almost impossible for me te get it in all threads and therefore general
processing is very slow (hours).
>     4. I tried to put in more debug code to find the line where the delays occured, but
the last line printed to the log at a delay seemed random, leading me to believe that the
log is not being flushed uniformly.
>     5. The profiler I used seemed to imply that about 100% of the time was spent in javallang.Thread.sleep.
 I am not completely familiar with the profiler I used so I am not completely sure I inturpreted
this correctly.
>     I will keep debugging here, but perhaps someone here has some insight into what might
be happening?

Others have also reported a problem with generate performance. It
seems we have a problem here but I can not reproduce this behaviour so
I am not sure what causes it. Can you open a JIRA issue and enter your
comments there? Also, how you are running generate will be very
helpful (what is what is -topN argument, etc.)

>                         thanks
>                             -J

Doğacan Güney
View raw message