tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Caleb Ott (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2802) Out of memory issues when extracting large files (pst)
Date Mon, 07 Jan 2019 21:09:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736351#comment-16736351

Caleb Ott commented on TIKA-2802:

It looks like manually adding the xercesImpl dependency to my project resolved the issue!
// build.gradle dependencies
// https://mvnrepository.com/artifact/xerces/xercesImpl
compile group: 'xerces', name: 'xercesImpl', version: '2.12.0'
I didn't need to add the "-Djavax.xml.parsers..." command line arguments, and I also switched
back to the 1.20 Tika release.

It looks like it was using the xerces version that was built in to Java before I added the
xerces dependency, which seems pretty outdated. Should Tika be adding that dependency automatically,
or is it expected that we should add that dependency ourselves if we want to be using xerces2?

Note: I'll have to do some more in depth testing to make sure the issue is fully resolved,
but it fixed the test scenario I was using. 

> Out of memory issues when extracting large files (pst)
> ------------------------------------------------------
>                 Key: TIKA-2802
>                 URL: https://issues.apache.org/jira/browse/TIKA-2802
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.20, 1.19.1
>         Environment: Reproduced on Windows 2012 R2 and Ubuntu 18.04.
> Java: jdk1.8.0_151
>            Reporter: Caleb Ott
>            Priority: Critical
>         Attachments: Selection_111.png, Selection_117.png
> I have an application that extracts text from multiple files on a file share. I've been
running into issues with the application running out of memory (~26g dedicated to the heap).
> I found in the heap dumps there is a "fDTDDecl" buffer which is creating very large char
arrays and never releasing that memory. In the picture you can see the heap dump with 4 SAXParsers
holding onto a large chunk of memory. The fourth one is expanded to show it is all being held
by the "fDTDDecl" field. This dump is from a scaled down execution (not a 26g heap).
> It looks like that DTD field should never be that large, I'm wondering if this is a bug
with xerces instead? I can easily reproduce the issue by attempting to extract text from large
.pst files.

This message was sent by Atlassian JIRA

View raw message