tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Ott <alex...@gmail.com>
Subject Re: Detecting container formats
Date Tue, 15 Jun 2010 18:32:34 GMT

Nick Burch  at "Tue, 15 Jun 2010 18:25:13 +0100 (BST)" wrote:
 NB> Hi All

 NB> I've been thinking about TIKA-391 (intermittent incorrect mime type detection of office
 NB> formats), and I think we might need to do something different for container formats.

 NB> At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsd etc), and for ZIP
 NB> files (.zip, but also .xlsx, .pptx, .docx, .odf, .odt, .ots, .sxw etc), I don't think
 NB> current method works well. AFAICT,
 NB> we detect the container, then have sub-class matches that try to look for the appropriate
 NB> children by hoping we can guess where the definition might hide within the
 NB> container. However, I think this is too unreliable - for example, with a .doc file,
 NB> entry for the Word stream can come anywhere in the list of top level entries, so is
 NB> hard to reliably find without properly parsing the OLE2 structure

Hmmm, WordDocument stream in .doc could be only under / directory entry,
but yes - it could anywhere in list of OLE2 entries...

 NB> So, I'd like to suggest a slightly different approach, one of loading the container
 NB> to decide the mime type. This will, of course, make the detection step slower and
 NB> memory hungry for detecting these (but only these) kinds of documents. However, provided
 NB> that we keep the open container around and pass it to the parser in a later step,
 NB> work we would've done anyway.

 NB> I'd then see the mime process be something like:
 NB> * Loop over all magic rules
 NB>   * If the magic fits and the file extension fits, pick this one
 NB>   * Otherwise if the magic fits and it's a container:
 NB>     * Load the container
 NB>     * Check the top level entries against our list for that container
 NB>     * If we get a hit, pick that
 NB>     * If nothing hits, assume it's just the container

Maybe it would useful to make this configurable? Sometimes it's useful to
force media type detection by magic only, not by extension (for example,
file could be renamed)...

With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/        http://alexott.net/
Skype: alex.ott

View raw message