tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: PDF parser exception
Date Wed, 13 Jan 2010 00:12:29 GMT
Hi Doug,

> The problem *seems* to be limited to those documents created by
> Acrobat 9. (PDF version 1.5 versus version 1.4) That is, 1.4 documents
> translate OK, where 1.5 documents get this error.
>
> If it matters, the bad file can be opened OK with Acrobat Reader.
>
> Any ideas on how to debug this? Or is Acrobat 9 (version 1.5) a
> known problem for Tika?

Acrobat 9 was a known problem for PDFBox, which is the PDF parser that  
Tika wraps.

But according to http://issues.apache.org/jira/browse/PDFBOX-361, this  
was fixed in 0.8-incubating, which is the release that Tika is using.

However I see http://issues.apache.org/jira/browse/PDFBOX-536, which  
seems to be the same as your issue. That's fixed in PDFBox's trunk,  
but not the 0.8-incubating release.

I've also had to pull/build PDFBox to get a recent (post-0.8) fix, so  
you could do the same.

-- Ken

> On Tue, Jan 12, 2010 at 02:18:02PM -0800, Ken Krugler wrote:
>> Hi Doug,
>>
>> On Jan 12, 2010, at 11:37am, Doug Carter wrote:
>>
>>>
>>> Hi all,
>>>
>>> I'm new to Tika and to this mailing list, so I hope this is the  
>>> right
>>> place to ask this question.
>>>
>>> I've just downloading, built and installed Tika 0.5. I've been  
>>> able to
>>> translate Microsoft Office documents without any problems. However,
>>> when
>>> I try to translate a PDF file, I get a parser exception.
>>
>> Is this the case with any and all PDF files?
>>
>> Based on the stack trace below, it sure looks like a busted file, but
>> I've mostly been working with the HTML parser.
>>
>> -- Ken
>>
>>>
>>> The command line I'm running is:
>>>
>>> % java -jar tika-app/target/tika-app-0.5.jar foo.pdf
>>>
>>> The resulting exception output is:
>>>
>>> Exception in thread "main" org.apache.tika.exception.TikaException:
>>> TIKA-198: Illegal IOException from
>>> org.apache.tika.parser.pdf.PDFParser@11e1e67
>>>      at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java: 
>>> 126)
>>>      at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:
>>> 101)
>>>      at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:175)
>>>      at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
>>> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:
>>> 237)
>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:
>>> 841)
>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:
>>> 808)
>>>      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:
>>> 53)
>>>      at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java: 
>>> 120)
>>>      ... 3 more
>>> Caused by: java.util.NoSuchElementException
>>>      at java.util.AbstractList$Itr.next(AbstractList.java:350)
>>>      at
>>> org
>>> .apache
>>> .pdfbox 
>>> .pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:
>>> 115)
>>>      at
>>> org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:
>>> 538)
>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:
>>> 203)
>>>      ... 7 more
>>>
>>> ---
>>>
>>> Can someone help point me to a way to solve this problem? I'm  
>>> familiar
>>> with Java but not the PDF format or how Tika parses a document.
>>>
>>> Please let me know if there is a better forum to ask this  
>>> question, or
>>> if I need to provide more information.
>>>
>>>
>>> TIA,
>>>
>>> Doug
>>
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Mime
View raw message