tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <nick.bu...@alfresco.com>
Subject Detecting container formats
Date Tue, 15 Jun 2010 17:25:13 GMT
Hi All

I've been thinking about TIKA-391 (intermittent incorrect mime type 
detection of office formats), and I think we might need to do something 
different for container formats.

At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsd etc), 
and for ZIP based files (.zip, but also .xlsx, .pptx, .docx, .odf, .odt, 
.ots, .sxw etc), I don't think the current method works well. AFAICT,
we detect the container, then have sub-class matches that try to look for 
the appropriate children by hoping we can guess where the definition might 
hide within the container. However, I think this is too unreliable - for 
example, with a .doc file, the entry for the Word stream can come anywhere 
in the list of top level entries, so is very hard to reliably find without 
properly parsing the OLE2 structure

So, I'd like to suggest a slightly different approach, one of loading the 
container format to decide the mime type. This will, of course, make the 
detection step slower and more memory hungry for detecting these (but only 
these) kinds of documents. However, provided that we keep the open 
container around and pass it to the parser in a later step, it's work we 
would've done anyway.

I'd then see the mime process be something like:
* Loop over all magic rules
   * If the magic fits and the file extension fits, pick this one
   * Otherwise if the magic fits and it's a container:
     * Load the container
     * Check the top level entries against our list for that container
     * If we get a hit, pick that
     * If nothing hits, assume it's just the container

eg we have a file with the zip magic, but no / unreliable filename.
  We open the zip file and look at the top level directory entries.
  If we spot [Content_Types].xml and /xl/ we know it's an OOXML Excel file
  If we spot meta.xml and mimetype then read mimetype and go from there
  Else decide it's just a zipfile of files, and handle appropriately

What does everyone else think? Is the extra work in the mime detection 
step (but only for container formats with no reliable filename) worth it 
for the improved detection?

note - the issue of when given a filename with a useful extension of being
  able to reliably pick the right mime type still needs to be solved, but
  largely wouldn't be affected by this


View raw message