tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2802) Out of memory issues when extracting large files (pst)
Date Tue, 08 Jan 2019 16:35:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737283#comment-16737283
] 

Tim Allison commented on TIKA-2802:
-----------------------------------

{noformat}
+- org.apache.ctakes:ctakes-core:jar:4.0.0:provided
[INFO] |  +- org.apache.ctakes:ctakes-core-res:jar:4.0.0:provided
[INFO] |  +- xerces:xercesImpl:jar:2.11.0:provided
{noformat}

That would explain why I'm seeing xerces in my dev environment, and you're not seeing it when
you pull it in.

Given your findings, it makes sense to me include xerces2.  Fellow devs, any objections?

> Out of memory issues when extracting large files (pst)
> ------------------------------------------------------
>
>                 Key: TIKA-2802
>                 URL: https://issues.apache.org/jira/browse/TIKA-2802
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.20, 1.19.1
>         Environment: Reproduced on Windows 2012 R2 and Ubuntu 18.04.
> Java: jdk1.8.0_151
>  
>            Reporter: Caleb Ott
>            Priority: Critical
>         Attachments: Selection_111.png, Selection_117.png
>
>
> I have an application that extracts text from multiple files on a file share. I've been
running into issues with the application running out of memory (~26g dedicated to the heap).
> I found in the heap dumps there is a "fDTDDecl" buffer which is creating very large char
arrays and never releasing that memory. In the picture you can see the heap dump with 4 SAXParsers
holding onto a large chunk of memory. The fourth one is expanded to show it is all being held
by the "fDTDDecl" field. This dump is from a scaled down execution (not a 26g heap).
> It looks like that DTD field should never be that large, I'm wondering if this is a bug
with xerces instead? I can easily reproduce the issue by attempting to extract text from large
.pst files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message