tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Hetzner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-617) Series of exceptions from PDFBox
Date Mon, 23 Jun 2014 17:05:24 GMT

    [ https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14040968#comment-14040968
] 

Erik Hetzner commented on TIKA-617:
-----------------------------------

The URL containing the PDF is listed in the above comment. Trying it with 1.5 gives different
errors and generates an incomplete XML file:

{noformat}
java -jar tika-app-1.5.jar http://www.arb.ca.gov/research/apr/past/01-340.pdf  > /dev/null
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
Exception in thread "main" org.apache.tika.exception.TikaException: Unable to extract PDF
content
	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:122)
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:142)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:418)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:112)
Caused by: java.io.IOException
	at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:138)
	at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:336)
	at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:248)
	at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:183)
	at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:107)
	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)
	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)
	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:106)
	... 7 more
Caused by: java.util.zip.DataFormatException: invalid distance too far back
	at java.util.zip.Inflater.inflateBytes(Native Method)
	at java.util.zip.Inflater.inflate(Inflater.java:259)
	at java.util.zip.Inflater.inflate(Inflater.java:280)
	at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169)
	at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
	... 18 more
{noformat}

> Series of exceptions from PDFBox
> --------------------------------
>
>                 Key: TIKA-617
>                 URL: https://issues.apache.org/jira/browse/TIKA-617
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.10
>            Reporter: Erik Hetzner
>
> Hi,
> I am getting the following exception from PDFBox. Thank you!
> (If I should file these upstream at PDFBox first, please let me know.)
> {noformat}
> $ java -jar tika-app-1.0-SNAPSHOT.jar http://www.arb.ca.gov/research/apr/past/01-340.pdf
> /dev/null
> ERROR - Stop reading corrupt stream
> INFO - unsupported/disabled operation: f24.481
> INFO - unsupported/disabled operation: ree)n.
> WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast
to org.apache.pdfbox.cos.COSArray
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray
> 	at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
> 	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
> INFO - unsupported/disabled operation: i-
> INFO - unsupported/disabled operation: R4%
> INFO - unsupported/disabled operation: )
> INFO - unsupported/disabled operation: Re.8
> INFO - unsupported/disabled operation: e.
> INFO - unsupported/disabled operation: FE)-
> WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast
to org.apache.pdfbox.cos.COSArray
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray
> 	at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
> 	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
> INFO - unsupported/disabled operation: R3%
> INFO - unsupported/disabled operation: T
> Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.pdf.PDFParser@5809fdee
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
> Caused by: java.lang.RuntimeException: java.io.IOException: Error: Expected operator
'ID' actual='I8'
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:178)
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
> 	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	... 5 more
> Caused by: java.io.IOException: Error: Expected operator 'ID' actual='I8'
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:382)
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46)
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:175)
> 	... 15 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message