tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Ott <alex...@gmail.com>
Subject Re: Detecting container formats
Date Wed, 16 Jun 2010 11:16:38 GMT

Nick Burch  at "Wed, 16 Jun 2010 12:01:48 +0100 (BST)" wrote:
 NB> On Tue, 15 Jun 2010, Alex Ott wrote:
 >> Hmmm, WordDocument stream in .doc could be only under / directory entry, but yes
- it
 >> could anywhere in list of OLE2 entries...

 NB> And the list of ole2 entries can come anywhere in the file - the header block contains
 NB> pointer to the block holding the entries, which is normally near the start but isn't
 NB> required to be...

 NB> Detecting OLE2 or Zip with magic seems easy enough, but as mentioned it's whats inside
 NB> them that I don't think magic + a few regexps on the first few kbs will cut it :/

Yep, for OLE2 we need to get the whole file and generate list of entries in
it.  For Zip, we also need to get the whole file, but it could be enough to
read list of entries, although, sometimes we need to read some files from
archive to get correct mime type (odf, {doc,ppt,xls}x, ...)

I'm not sure how it's better to implement this in Tika, I need to look into
sources.  One possibility is to create hierarchy of container processors,
each of that will set corresponding subtype of container, and this value
will used in mime-type description. Something like

if (string at 0 = "PK\x03\x04" and subtype == 10)
then mimetype = "application/java-archive"

With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/           http://alexott.net

View raw message