tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: Detecting container formats
Date Tue, 15 Jun 2010 20:25:43 GMT
Hi Ken, and all,

FWIW, it's Tika can handle full regex on glob patterns now via the isregex attribute that
I added way back when in TIKA-194 [1].



On 6/15/10 11:56 AM, "Ken Krugler" <kkrugler_lists@transpac.com> wrote:

I think this is a reasonable approach, as long as (per Alex's
suggestion) it's configurable in various ways.

E.g. if you know you don't want to parse OLE2-based files, so you've
removed jars for those parser, then it would be great to have an easy
way of disabling the (more expensive) mime-type detection, and
potentially avoid the dependency on these same jars.

Separately, I think this issue might also trigger improvements to the
existing "magic bytes" detection code in Tika. IIRC, we wound up
adding full regex with some additional matching rules in Krugle, to
extend the (from Nutch, same as Tika) mime-type detection code to
better handle things like source code files. I imagine something
similar might be needed to reliably handle container matching.

-- Ken

On Jun 15, 2010, at 10:25am, Nick Burch wrote:

> Hi All
> I've been thinking about TIKA-391 (intermittent incorrect mime type
> detection of office formats), and I think we might need to do
> something different for container formats.
> At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsd
> etc), and for ZIP based files (.zip, but
> also .xlsx, .pptx, .docx, .odf, .odt, .ots, .sxw etc), I don't think
> the current method works well. AFAICT,
> we detect the container, then have sub-class matches that try to
> look for the appropriate children by hoping we can guess where the
> definition might hide within the container. However, I think this is
> too unreliable - for example, with a .doc file, the entry for the
> Word stream can come anywhere in the list of top level entries, so
> is very hard to reliably find without properly parsing the OLE2
> structure
> So, I'd like to suggest a slightly different approach, one of
> loading the container format to decide the mime type. This will, of
> course, make the detection step slower and more memory hungry for
> detecting these (but only these) kinds of documents. However,
> provided that we keep the open container around and pass it to the
> parser in a later step, it's work we would've done anyway.
> I'd then see the mime process be something like:
> * Loop over all magic rules
>  * If the magic fits and the file extension fits, pick this one
>  * Otherwise if the magic fits and it's a container:
>    * Load the container
>    * Check the top level entries against our list for that container
>    * If we get a hit, pick that
>    * If nothing hits, assume it's just the container
> eg we have a file with the zip magic, but no / unreliable filename.
> We open the zip file and look at the top level directory entries.
> If we spot [Content_Types].xml and /xl/ we know it's an OOXML Excel
> file
> If we spot meta.xml and mimetype then read mimetype and go from there
> ...
> Else decide it's just a zipfile of files, and handle appropriately
> What does everyone else think? Is the extra work in the mime
> detection step (but only for container formats with no reliable
> filename) worth it for the improved detection?
> note - the issue of when given a filename with a useful extension of
> being
> able to reliably pick the right mime type still needs to be solved,
> but
> largely wouldn't be affected by this
> Nick

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message