nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Closed: (NUTCH-85) pdf parser caused fetcher hangs.
Date Tue, 20 Sep 2005 07:10:29 GMT
     [ http://issues.apache.org/jira/browse/NUTCH-85?page=all ]
     
Andrzej Bialecki  closed NUTCH-85:
----------------------------------

    Resolution: Fixed

The parser has been updated to use PDFBox-0.7.2, which should solve this issue. Please re-open
if that's not the case.

> pdf parser caused fetcher hangs.
> --------------------------------
>
>          Key: NUTCH-85
>          URL: http://issues.apache.org/jira/browse/NUTCH-85
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7, 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev

>
> We notice that fetcher hangs caused by pdfbox.
> A thread handles a pdf parsing and may hangs and is never again available. 
> This happens as many times as threads are active and than the complete fetch process
hangs.
>  
> Full thread dump Java HotSpot(TM) Client VM (1.4.2_08-b03 mixed mode):
> "fetcher160" prio=1 tid=0x083c9720 nid=0x16de runnable [b1669000..b166a238]
> 	at org.pdfbox.cmaptypes.CMap.addMapping(CMap.java:119)
> 	at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:183)
> 	at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:532)
> 	at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:358)
> 	at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:261)
> 	at org.pdfbox.util.operator.ShowText.process(ShowText.java:63)
> 	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:405)
> 	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:385)
> 	at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:168)
> 	at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:232)
> 	at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:205)
> 	at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:180)
> 	at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:108)
> 	at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:123)
> 	at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:239)
> 	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> "fetcher82" prio=1 tid=0xb4637d78 nid=0x59aa runnable [b4379000..b437a238]
> 	at java.nio.charset.CoderResult$1.create(CoderResult.java:207)
> 	at java.nio.charset.CoderResult$Cache.get(CoderResult.java:196)
> 	- locked <0xb94fa908> (a java.nio.charset.CoderResult$1)
> 	at java.nio.charset.CoderResult$Cache.access$200(CoderResult.java:178)
> 	at java.nio.charset.CoderResult.malformedForLength(CoderResult.java:217)
> 	at sun.nio.cs.UnicodeDecoder.decodeLoop(UnicodeDecoder.java:71)
> 	at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:538)
> 	at java.lang.StringCoding$CharsetSD.decode(StringCoding.java:192)
> 	at java.lang.StringCoding.decode(StringCoding.java:230)
> 	at java.lang.String.<init>(String.java:320)
> 	at java.lang.String.<init>(String.java:346)
> 	at org.pdfbox.cmapparser.CMapParser.createStringFromBytes(CMapParser.java:230)
> 	at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:182)
> 	at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:532)
> 	at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:358)
> 	at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:261)
> 	at org.pdfbox.util.operator.ShowText.process(ShowText.java:63)
> 	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:405)
> 	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:385)
> 	at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:168)
> 	at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:232)
> 	at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:205)
> 	at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:180)
> 	at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:108)
> 	at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:123)
> 	at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:239)
> 	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message