From dev-return-9087-apmail-tika-dev-archive=tika.apache.org@tika.apache.org Mon Feb 4 14:54:16 2013 Return-Path: X-Original-To: apmail-tika-dev-archive@www.apache.org Delivered-To: apmail-tika-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 30848EF7D for ; Mon, 4 Feb 2013 14:54:16 +0000 (UTC) Received: (qmail 58738 invoked by uid 500); 4 Feb 2013 14:54:15 -0000 Delivered-To: apmail-tika-dev-archive@tika.apache.org Received: (qmail 58683 invoked by uid 500); 4 Feb 2013 14:54:15 -0000 Mailing-List: contact dev-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list dev@tika.apache.org Received: (qmail 58421 invoked by uid 99); 4 Feb 2013 14:54:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Feb 2013 14:54:14 +0000 Date: Mon, 4 Feb 2013 14:54:14 +0000 (UTC) From: "Michael McCandless (JIRA)" To: dev@tika.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (TIKA-1072) AIOOBE when handling embedded document in .doc file MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570308#comment-13570308 ] Michael McCandless commented on TIKA-1072: ------------------------------------------ OK I opened TIKA-1074; this issue will explore whether this document is corrupt or not ... > AIOOBE when handling embedded document in .doc file > --------------------------------------------------- > > Key: TIKA-1072 > URL: https://issues.apache.org/jira/browse/TIKA-1072 > Project: Tika > Issue Type: Bug > Reporter: Michael McCandless > Fix For: 1.4 > > Attachments: 20-Force-on-a-current-S00.doc > > > I have a Word (.doc) document that hits an exception when I run: > {noformat} > java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar /x/tmp/20-Force-on-a-current-S00.doc > {noformat} > Here's the exception: > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 40 > at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225) > at org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:139) > at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89) > at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149) > at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) > at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) > at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > {noformat} > It happens when we try to parse an OLE10 embedded object ... the code > that does this parsing captures and ignores Ole10NativeException and > skips the entry ... so I'm wondering if we should also catch AIOOBE > and skip the entry? Ie, maybe this entry really is not OLE10, and the > Ole10Native code is failing to throw Ole10NativeException for it? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira