nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1182) fetcher to log hung threads
Date Sun, 04 May 2014 22:11:19 GMT


Hudson commented on NUTCH-1182:

SUCCESS: Integrated in Nutch-nutchgora #1010 (See [])
NUTCH-1182 fetcher to log hung threads (snagel:
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/src/java/org/apache/nutch/fetcher/

> fetcher to log hung threads
> ---------------------------
>                 Key: NUTCH-1182
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.3, 1.4
>         Environment: Linux, local job runner
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Minor
>             Fix For: 2.3, 1.9
>         Attachments: NUTCH-1182-2x.patch, NUTCH-1182-trunk-v1.patch, NUTCH-1182-v2.patch
> While crawling a slow server with a couple of very large PDF documents (30 MB) on it
> after some time and a bulk of successfully fetched documents the fetcher stops
> with the message: ??Aborting with 10 hung threads.??
> From now on every cycle ends with hung threads, almost no documents are fetched
> successfully. In addition, strange hadoop errors are logged:
> {noformat}
>    fetch of http://.../xyz.pdf failed with: java.lang.NullPointerException
>     at java.lang.System.arraycopy(Native Method)
>     at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(
>     ...
> {noformat}
> or
> {noformat}
>    Exception in thread "QueueFeeder" java.lang.NullPointerException
>          at org.apache.hadoop.fs.BufferedFSInputStream.getPos(
>          at org.apache.hadoop.fs.FSDataInputStream.getPos(
>          at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(
> {noformat}
> I've run the debugger and found:
> # after the "hung threads" are reported the fetcher stops but the threads are still alive
and continue fetching a document. In consequence, this will
> #* limit the small bandwidth of network/server even more
> #* after the document is fetched the thread tries to write the content via {{output.collect()}}
which must fail because the fetcher map job is already finished and the associated temporary
mapred directory is deleted. The error message may get mixed with the progress output of the
next fetch cycle causing additional confusion.
> # documents/URLs causing the hung thread are never reported nor stored. That is, it's
hard to track them down, and they will cause a hung thread again and again.
> The problem is reproducible when fetching bigger documents and setting {{mapred.task.timeout}}
to a low value (this will definitely cause hung threads).

This message was sent by Atlassian JIRA

View raw message