tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Carter <dcar...@mercycorps.org>
Subject PDF parser exception
Date Tue, 12 Jan 2010 19:37:52 GMT

Hi all,

I'm new to Tika and to this mailing list, so I hope this is the right
place to ask this question.

I've just downloading, built and installed Tika 0.5. I've been able to
translate Microsoft Office documents without any problems. However, when
I try to translate a PDF file, I get a parser exception.

The command line I'm running is:

  % java -jar tika-app/target/tika-app-0.5.jar foo.pdf

The resulting exception output is:

Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
from org.apache.tika.parser.pdf.PDFParser@11e1e67
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:175)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
Caused by: org.apache.pdfbox.exceptions.WrappedIOException
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
        ... 3 more
Caused by: java.util.NoSuchElementException
        at java.util.AbstractList$Itr.next(AbstractList.java:350)
        at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
        at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
        ... 7 more


Can someone help point me to a way to solve this problem? I'm familiar
with Java but not the PDF format or how Tika parses a document. 

Please let me know if there is a better forum to ask this question, or
if I need to provide more information.



View raw message