tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <nick.bu...@alfresco.com>
Subject Re: Detecting container formats
Date Wed, 16 Jun 2010 11:01:48 GMT
On Tue, 15 Jun 2010, Alex Ott wrote:
> Hmmm, WordDocument stream in .doc could be only under / directory entry, 
> but yes - it could anywhere in list of OLE2 entries...

And the list of ole2 entries can come anywhere in the file - the header 
block contains a pointer to the block holding the entries, which is 
normally near the start but isn't required to be...

Detecting OLE2 or Zip with magic seems easy enough, but as mentioned it's 
whats inside them that I don't think magic + a few regexps on the first 
few kbs will cut it :/

> Maybe it would useful to make this configurable? Sometimes it's useful 
> to force media type detection by magic only, not by extension (for 
> example, file could be renamed)...

IIRC, if you don't set the filename in the Metadata object that you pass 
into the detector, then it can't use the file extension!

Not sure how you could best turn it off though, short of a config that 
would disable the loading of ole2 and zip files (and maybe other 
containers in the future), but then what (if any) would we return for the 
mimetype? Maybe just a generic one?


View raw message