tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: svn commit: r1165230 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/microsoft/ooxml/ test/java/org/apache/tika/parser/microsoft/ test/resources/test-documents/
Date Mon, 05 Sep 2011 12:23:07 GMT

On Mon, Sep 5, 2011 at 12:30 PM,  <maxcom@apache.org> wrote:
> Embedded file extraction is broken for some OOXML files
> (bug introduced few commits ago)

That was me in revision 1164578 for TIKA-704. :-(

> -            if (root.hasEntry("CONTENTS")) {
> -                stream = TikaInputStream.get(
> -                        fs.createDocumentInputStream("CONTENTS"));

This was my attempt at properly handling the embedded PDF in
TestWithPdf.docx. It was included in an OLE object with the PDF
document as it's "CONTENTS" entry. I restored this functionality with
some more specific checks in revision 1165259, and the resulting code
should now work correctly with all the test documents we have.

Improvements welcome, as I'm no expert on POI or the Office file format.


Jukka Zitting

View raw message