tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
Date Thu, 14 Jan 2016 16:58:39 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098393#comment-15098393
] 

Tim Allison commented on TIKA-1830:
-----------------------------------

Finished the rerun...and the results look the same.

Question: On PDFBOX-3193, you've set affected versions to 1.8.10 and 1.8.11.  Are you sure
that that affects 1.8.10?  The discovery of that wouldn't have happened unless I was actually
running 1.8.11. 

In 1.8.10, 074531.pdf has ~30k words.  When I run 1.8.11 as a unit test within our PDFParser
wrapper, I also get ~30k words.  However, when I rerun our batch wrapper around 1.8.11 on
this file, I get the same exception in a rerun as I did in the original run (reported in the
reports attached yesterday).

The exception is:

{noformat}
java.lang.NullPointerException
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1077)
at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066)
at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:276)
at org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:49)
at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:193)
at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:205)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:256)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:216)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:471)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:395)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:354)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
{noformat}

I get the same exception when I run this in our batch code with 1 consumer or 10 consumers...so
it isn't a multithreading issue....hmmmm....will dig some more.

As a side note, I thought I wasn't comparing contents if there was an exception in one of
the files...I need to fix my SQL to make sure this is the case.


> Upgrade to PDFBox 1.8.11 when available
> ---------------------------------------
>
>                 Key: TIKA-1830
>                 URL: https://issues.apache.org/jira/browse/TIKA-1830
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>         Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message