tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: svn commit: r1165230 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/microsoft/ooxml/ test/java/org/apache/tika/parser/microsoft/ test/resources/test-documents/
Date Mon, 05 Sep 2011 17:02:55 GMT

2011/9/5 Maxim Valyanskiy <maxcom@jet.msk.su>:
> 05.09.2011, в 16:23, Jukka Zitting написал(а):
>> This was my attempt at properly handling the embedded PDF in
>> TestWithPdf.docx. It was included in an OLE object with the PDF
>> document as it's "CONTENTS" entry. I restored this functionality with
>> some more specific checks in revision 1165259, and the resulting code
>> should now work correctly with all the test documents we have.
> Hm, that is strange - current version of OfficeParser.POIFSDocumentType.detectType()
> thinks that "CONTENTS" part identifies POI filesystem as MS Works document.
> Maybe this is not right.

I think we have some MS Works test files that do contain the
"CONTENTS" entry, though I'm not sure if that's the best possible
heuristic for detecting MS Works documents. My fix in revision 1165259
also checks for the presence of explicit OLE entries, which I believe
should help prevent collisions with actual embedded MS Works

> Please add unit test with that TestWithPdf.docx.

The file was uploaded without the "grant license" option (and I
couldn't create a similar document myself) so I unfortunately couldn't
add the test case along with my original commit. I asked for the
required license grant in TIKA-704 and will add the test case if


Jukka Zitting

View raw message