tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-617) Series of exceptions from PDFBox
Date Mon, 23 Jun 2014 17:41:26 GMT

    [ https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041025#comment-14041025
] 

Tim Allison commented on TIKA-617:
----------------------------------

Confirmed still a problem with both classic (sequential) and newer NonSequentialParser in
Tika trunk with PDFBox 1.8.6.  Please open an issue in PDFBox if you haven't done so already.
 Thank you!

Found same issue here (although Adobe couldn't read this one either without serious problems):
http://digitalcorpora.org/corp/nps/files/govdocs1/898/898385.pdf

> Series of exceptions from PDFBox
> --------------------------------
>
>                 Key: TIKA-617
>                 URL: https://issues.apache.org/jira/browse/TIKA-617
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.10
>            Reporter: Erik Hetzner
>
> Hi,
> I am getting the following exception from PDFBox. Thank you!
> (If I should file these upstream at PDFBox first, please let me know.)
> {noformat}
> $ java -jar tika-app-1.0-SNAPSHOT.jar http://www.arb.ca.gov/research/apr/past/01-340.pdf
> /dev/null
> ERROR - Stop reading corrupt stream
> INFO - unsupported/disabled operation: f24.481
> INFO - unsupported/disabled operation: ree)n.
> WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast
to org.apache.pdfbox.cos.COSArray
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray
> 	at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
> 	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
> INFO - unsupported/disabled operation: i-
> INFO - unsupported/disabled operation: R4%
> INFO - unsupported/disabled operation: )
> INFO - unsupported/disabled operation: Re.8
> INFO - unsupported/disabled operation: e.
> INFO - unsupported/disabled operation: FE)-
> WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast
to org.apache.pdfbox.cos.COSArray
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray
> 	at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
> 	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
> INFO - unsupported/disabled operation: R3%
> INFO - unsupported/disabled operation: T
> Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.pdf.PDFParser@5809fdee
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
> Caused by: java.lang.RuntimeException: java.io.IOException: Error: Expected operator
'ID' actual='I8'
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:178)
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
> 	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	... 5 more
> Caused by: java.io.IOException: Error: Expected operator 'ID' actual='I8'
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:382)
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46)
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:175)
> 	... 15 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message