tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1072) AIOOBE when handling embedded document in .doc file
Date Mon, 04 Feb 2013 15:24:12 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated TIKA-1072:
-------------------------------------

    Attachment: Ole10NativeEntry.bin

I'm attaching the 40 byte \U0001Ole10Native entry (40 bytes); here's the hex dump:

00000000  24 00 00 00 02 00 01 01  00 0a 01 12 83 46 02 86  |$............F..|
00000010  3d 12 83 49 12 83 6c 12  83 42 12 82 73 12 82 69  |=..I..l..B..s..i|
00000020  12 82 6e 02 84 71 00 00                           |..n..q..|
00000028

                
> AIOOBE when handling embedded document in .doc file
> ---------------------------------------------------
>
>                 Key: TIKA-1072
>                 URL: https://issues.apache.org/jira/browse/TIKA-1072
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Michael McCandless
>             Fix For: 1.4
>
>         Attachments: 20-Force-on-a-current-S00.doc, Ole10NativeEntry.bin
>
>
> I have a Word (.doc) document that hits an exception when I run:
> {noformat}
> java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar /x/tmp/20-Force-on-a-current-S00.doc

> {noformat}
> Here's the exception:
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
> 	at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
> 	at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:139)
> 	at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
> 	at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
> 	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> {noformat}
> It happens when we try to parse an OLE10 embedded object ... the code
> that does this parsing captures and ignores Ole10NativeException and
> skips the entry ... so I'm wondering if we should also catch AIOOBE
> and skip the entry?  Ie, maybe this entry really is not OLE10, and the
> Ole10Native code is failing to throw Ole10NativeException for it?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message