Hello
Nick Burch at "Tue, 15 Jun 2010 18:25:13 +0100 (BST)" wrote:
NB> Hi All
NB> I've been thinking about TIKA-391 (intermittent incorrect mime type detection of office
NB> formats), and I think we might need to do something different for container formats.
NB> At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsd etc), and for ZIP
based
NB> files (.zip, but also .xlsx, .pptx, .docx, .odf, .odt, .ots, .sxw etc), I don't think
the
NB> current method works well. AFAICT,
NB> we detect the container, then have sub-class matches that try to look for the appropriate
NB> children by hoping we can guess where the definition might hide within the
NB> container. However, I think this is too unreliable - for example, with a .doc file,
the
NB> entry for the Word stream can come anywhere in the list of top level entries, so is
very
NB> hard to reliably find without properly parsing the OLE2 structure
Hmmm, WordDocument stream in .doc could be only under / directory entry,
but yes - it could anywhere in list of OLE2 entries...
NB> So, I'd like to suggest a slightly different approach, one of loading the container
format
NB> to decide the mime type. This will, of course, make the detection step slower and
more
NB> memory hungry for detecting these (but only these) kinds of documents. However, provided
NB> that we keep the open container around and pass it to the parser in a later step,
it's
NB> work we would've done anyway.
NB> I'd then see the mime process be something like:
NB> * Loop over all magic rules
NB> * If the magic fits and the file extension fits, pick this one
NB> * Otherwise if the magic fits and it's a container:
NB> * Load the container
NB> * Check the top level entries against our list for that container
NB> * If we get a hit, pick that
NB> * If nothing hits, assume it's just the container
Maybe it would useful to make this configurable? Sometimes it's useful to
force media type detection by magic only, not by extension (for example,
file could be renamed)...
--
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/ http://alexott.net/
http://alexott-ru.blogspot.com/
Skype: alex.ott
|