nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim (JIRA)" <>
Subject [jira] Commented: (NUTCH-551) performance for generate is often really bad
Date Sat, 08 Sep 2007 02:14:30 GMT


Jim commented on NUTCH-551:

It is really maddening, but I can not reproduce the bug with the jdb debugger attached.  Whenever
I run with jdb generate just finishes immediately (in minutes).  On the bright side I am now
certain that there *is* a bug, because I can see from start to finish how long generate should
take (minutes as opposed to hours).

Also, I've been watching the map/reduce log progress, and also the output log at the same
time and have verified that the chunkyness has something to do with the slowdown.  The logs
show a progression of the map in a steady fashion until the logs start pausing every second
for a second.  Then the percentage only goes very slowly.

> performance for generate is often really bad
> --------------------------------------------
>                 Key: NUTCH-551
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: Ubuntu, Core duo 2.4GhZ, 1 gig ram, 750GB hard drive.
>  The ethernet connection has a dedicated 1gb connection to the web, so certainly that
isn't a problem.
> I have tested on nutch 0.9 and the newest daily build from 2007-08-28.
>            Reporter: Jim
>         Generate often takes many hours to finish (6+), where I would expect it to be
done in minutes.
>         This behavior has been observed for topN of small (~100) and large (~1000000)
values.  Other configuration values are
> -1
> false
>         I added debug code to Generator-> to see when map is called and
returns, and observed interesting behavior, described here:
>         1. Most of the time, when generate is run urls are processed in chunky batches,
usually about 40 at a time, followed by a 1 second delay.  I timed the delay, and it really
is a 1 second delay (ie- 30 batches was 30 seconds.)  When this happens it takes hours to
>         2. Sometimes (randomly as far as I can tell) when I run nutch, the urls are processed
without delays.  It is an all or nothing event, either I run and all urls process quickly
without delay (in minutes), or more likely I get the chunky processing with many 1 second
delays and the program takes hours to end.  The one exception is....
>         3. When the processing runs quickly I've seen the main thread end (I have some
profiling going, so I know when a thread ends), and then more likely than not a second thread
begins where the first starts, chunky like usual.  Although I sometimes can get fast processing
in one thread, it is almost impossible for me te get it in all threads and therefore general
processing is very slow (hours).
>         4. I tried to put in more debug code to find the line where the delays occured,
but the last line printed to the log at a delay seemed random, leading me to believe that
the log is not being flushed uniformly.  The timestamps in the log always indicate that the
delay is wither right before or after the first log item in the map function.
>         5. The profiler I used seemed to imply that about 100% of the time was spent
in javallang.Thread.sleep.  I am not completely familiar with the profiler I used so I am
not completely sure I inturpreted this correctly.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message