tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Burlison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case
Date Tue, 22 Sep 2015 13:52:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902657#comment-14902657
] 

Alan Burlison commented on TIKA-1737:
-------------------------------------

.bq Could we have done something at the Tika level to cause this...I wonder?

I don't believe so. I think PDFBox is just not cleaning up properly after an exception. If
you want to 'fix' (?) this at the Tika level I think you'd have to do something similar to
what I'm doing and create a new PDFBox instance each time there's a PDFBox exception.

.bq Does the heap usage jump for every type of exception...that is, if I find any old PDF
that triggers an exception, do you think I'll see this with Tika 1.10?

Pretty much. I'm going to try to get a heap dump to work on but that means undoing all the
workaround code I've added, so it will take a bit for me to do that.

.bq Out of curiosity, are you using Tika in the same jvm as Lucene?

Yes, the app is the same as described in TIKA-1471. It's actually a Tomcat instance that contains
both Lucene indexer and search, where Tika is being used for text extraction for the Lucene
indexer.


> PDFBox 1.8.10 is still a basket case
> ------------------------------------
>
>                 Key: TIKA-1737
>                 URL: https://issues.apache.org/jira/browse/TIKA-1737
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.10
>         Environment: Linux, Solaris
>            Reporter: Alan Burlison
>         Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that bug the
issues were fixed in 1.7. I've just updated to Tika 1.10 and rather than PDFBox being better
it's actually far, far worse. With the same corpus, Tika 1.5 (PDFBox 1.8.6) has 13 exceptions
thrown by PDFBox, Tika 1.10 (PDFBox 1.8.10) has *453* exceptions thrown by PDFBox. Not only
that, but as far as I can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each time there's
an error indexing a PDF file. It's so bad I'm going to switch to running pdftotext (part of
Xpdf) as an external process. Note that many of the errors in PDFBox are clearly caused by
programming errors, e.g. ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException
and EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a replacement
for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message