From dev-return-30065-apmail-tika-dev-archive=tika.apache.org@tika.apache.org Thu Jan 3 20:23:45 2019 Return-Path: X-Original-To: apmail-tika-dev-archive@www.apache.org Delivered-To: apmail-tika-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C73FC1881E for ; Thu, 3 Jan 2019 20:23:45 +0000 (UTC) Received: (qmail 65411 invoked by uid 500); 3 Jan 2019 20:23:45 -0000 Delivered-To: apmail-tika-dev-archive@tika.apache.org Received: (qmail 65361 invoked by uid 500); 3 Jan 2019 20:23:45 -0000 Mailing-List: contact dev-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list dev@tika.apache.org Received: (qmail 65350 invoked by uid 99); 3 Jan 2019 20:23:45 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Jan 2019 20:23:45 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 1CA1DC236B for ; Thu, 3 Jan 2019 20:23:45 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -110.301 X-Spam-Level: X-Spam-Status: No, score=-110.301 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id WlzZ75QsUasD for ; Thu, 3 Jan 2019 20:23:44 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 91A1A60FAA for ; Thu, 3 Jan 2019 20:17:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id F35ADE2636 for ; Thu, 3 Jan 2019 20:17:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 34C1325414 for ; Thu, 3 Jan 2019 20:17:00 +0000 (UTC) Date: Thu, 3 Jan 2019 20:17:00 +0000 (UTC) From: "Hudson (JIRA)" To: dev@tika.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (TIKA-2802) Out of memory issues when extracting large files (pst) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/TIKA-2802?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D16733= 485#comment-16733485 ]=20 Hudson commented on TIKA-2802: ------------------------------ FAILURE: Integrated in Jenkins build tika-2.x-windows #369 (See [https://bu= ilds.apache.org/job/tika-2.x-windows/369/]) TIKA-2802 -- try to clear the XMLReader's resources to avoid OOM (tallison:= rev a0688825b15b8d3f1672236b0f1f6536c8a863c4) * (edit) tika-core/src/main/java/org/apache/tika/utils/XMLReaderUtils.java > Out of memory issues when extracting large files (pst) > ------------------------------------------------------ > > Key: TIKA-2802 > URL: https://issues.apache.org/jira/browse/TIKA-2802 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.20, 1.19.1 > Environment: Reproduced on Windows 2012 R2 and Ubuntu 18.04. > Java:=C2=A0jdk1.8.0_151 > =C2=A0 > Reporter: Caleb Ott > Priority: Critical > Attachments: Selection_111.png > > > I have an application that extracts text from multiple files on a file sh= are. I've been running into issues with the application running out of memo= ry (~26g dedicated to the heap). > I found in the heap dumps there is a "fDTDDecl" buffer which is creating = very large char arrays and never releasing that memory. In the picture you = can see the heap dump with 4 SAXParsers holding onto a large chunk of memor= y. The fourth one is expanded to show it is all being held by the "fDTDDecl= " field. This dump is from a scaled down=C2=A0execution (not a 26g heap). > It looks like that DTD field should never be that large, I'm wondering if= this is a bug with xerces instead? I can easily reproduce the issue by att= empting to extract text from large .pst files. -- This message was sent by Atlassian JIRA (v7.6.3#76005)