tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <nick.bu...@alfresco.com>
Subject Re: svn commit: r1165230 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/microsoft/ooxml/ test/java/org/apache/tika/parser/microsoft/ test/resources/test-documents/
Date Mon, 05 Sep 2011 17:06:05 GMT
On Mon, 5 Sep 2011, Jukka Zitting wrote:
>> Hm, that is strange - current version of 
>> OfficeParser.POIFSDocumentType.detectType() thinks that "CONTENTS" part 
>> identifies POI filesystem as MS Works document. Maybe this is not 
>> right.
>
> I think we have some MS Works test files that do contain the
> "CONTENTS" entry, though I'm not sure if that's the best possible
> heuristic for detecting MS Works documents.

I've checked a few sample ones, and they have both CONTENTS and SPELLING, 
so I tweaked the rule to look for both

> My fix in revision 1165259 also checks for the presence of explicit OLE 
> entries, which I believe should help prevent collisions with actual 
> embedded MS Works documents.

I think we might want a different type for OLE1 native and general OLE2, 
as currently the detector won't let us spot the difference between them?

Nick

Mime
View raw message