tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Hardwick (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-1462) PDFont consumes all heap space
Date Wed, 29 Oct 2014 21:37:33 GMT
James Hardwick created TIKA-1462:

             Summary: PDFont consumes all heap space
                 Key: TIKA-1462
                 URL: https://issues.apache.org/jira/browse/TIKA-1462
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.6
            Reporter: James Hardwick
            Priority: Critical

See https://issues.apache.org/jira/browse/PDFBOX-2200 for more details.

In short, PDFont will not release resources, and will eventually amass enough objects to consume
all available memory. We are encountering this in productions environments, causing our solr
server to crash when ingesting large amounts of PDF documents. 

The fix is supposedly in for the 2.0.0 release of PDFBox, but that version has been outstanding
for so long that I'd suggest implementing the workaround as proposed in the PDFBox issue.

This message was sent by Atlassian JIRA

View raw message