tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Burlison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case
Date Mon, 21 Sep 2015 20:49:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901371#comment-14901371

Alan Burlison commented on TIKA-1737:

I'll redo the test and compare the outputs but from memory the later PDFBox version was successfully
indexing slightly more files, despite all the exceptions. Unfortunately I can't share the
PDFs as they are confidential but it's a set of around 5000 PDFs dating back as far as 1992
so I know some of them are pretty certain to be non-compliant and they are therefore a bit
of a torture test. And yes I'd be happy to test an updated version.

Note that the exceptions I attached to the bug are just the ones that had a useful stack trace,
there were many more that just had a single line of error, presumably they are being caught
within PDFBox itself.

I should point out that although the increase in exceptions is concerning, the real issue
are the horrendous memory leaks caused whenever a PDFBox exception is thrown, that's definitely
got worse. It's not very helpful detecting more errors and throwing more exceptions if that
just results in even more memory being leaked.

> PDFBox 1.8.10 is still a basket case
> ------------------------------------
>                 Key: TIKA-1737
>                 URL: https://issues.apache.org/jira/browse/TIKA-1737
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.10
>         Environment: Linux, Solaris
>            Reporter: Alan Burlison
>         Attachments: pdfbox.txt
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that bug the
issues were fixed in 1.7. I've just updated to Tika 1.10 and rather than PDFBox being better
it's actually far, far worse. With the same corpus, Tika 1.5 (PDFBox 1.8.6) has 13 exceptions
thrown by PDFBox, Tika 1.10 (PDFBox 1.8.10) has *453* exceptions thrown by PDFBox. Not only
that, but as far as I can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each time there's
an error indexing a PDF file. It's so bad I'm going to switch to running pdftotext (part of
Xpdf) as an external process. Note that many of the errors in PDFBox are clearly caused by
programming errors, e.g. ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException
and EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a replacement
for PDFBox as 1.8.10 just isn't fit for purpose.

This message was sent by Atlassian JIRA

View raw message