tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Antoni Mylka (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-560) Improve detection of .mht, Foxmail, and OOXML files
Date Fri, 26 Nov 2010 20:29:14 GMT

    [ https://issues.apache.org/jira/browse/TIKA-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12936090#action_12936090

Antoni Mylka commented on TIKA-560:

MimeTypes, when you pass a null stream - uses the name to identify the file. I wanted to use
the ContainerAwareDetector as a "better MimeTypes". So the use case for supporting null stream
is the same as the use case for supporting null stream in MimeTypes - use only the name for
identification. Otherwise I need to add additional code to use ContainerAwareDetector if I
know the stream, but MimeTypes directly if I only want name-based identification. Allowing
null in ContainerAwareDetector seemed more user-friendly to me.

As for the lower case, this is a paragraph from RFC 2045

   The type, subtype, and parameter names are not case sensitive.  For
   example, TEXT, Text, and TeXt are all equivalent top-level media

So the "proper" way is to *remember* to use equalsIgnoreCase whenever comparing mime types,
but this doesn't work if you have a set and try to determine if a given mime type is in a
set. E.g. a set of "document" mime types, or a set of "mimetypes openable on this machine"
or a set of "mime types editable on this machine". In such case I need to *remember* to lower-case
the strings in the set, and then to lower-case the strings before checking. 

In my humble opinion using lower case for mime types is more user-friendly.

> Improve detection of .mht, Foxmail, and OOXML files
> ---------------------------------------------------
>                 Key: TIKA-560
>                 URL: https://issues.apache.org/jira/browse/TIKA-560
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Antoni Mylka
>         Attachments: test-documents.zip, tika-560.patch
> I would like to address the following issues
> 1. Reduce the priority of the text/html magics. WIth the default priority I have lots
of .eml, .emlx, mbox and .mht files which contain html content but should not be classified
as XML. The reason for that is that the HTML magic looks for <html> between 0 and 8192
offsets. In Aperture we solved this with an allowsWhiteSpace switch, so that the <html>
can be prepended with whitespace but not with other content. Since there is no such switch
in Tika, I suggest reducing the priority of the magic in tika-mimetypes. I attach an .mht
file from the Aperture test document suite which exhibits the problem.
> 2. Add support for detecting Foxmail. They come from Foxmail, a mail client popular in
china, they are roughly the same as mbox, but use a different separator. 
> 3. In case of OOXML files, the container aware detector computes the mimetype by taking
the part of [Content_Types.xml], namely:
> <Default Extension="bin" ContentType="application/vnd.ms-excel.sheet.binary.macroEnabled.main"/>
> then it takes the default content type and returns it with the part after the last dot
removed. There are two issues with this approach
>  a. some documents use macroEnabled, while other use macroenabled so the case is not
>  b. the "official" mime types, contain a '12' suffix at the end, as shown at: http://technet.microsoft.com/en-us/library/ee309278%28office.12%29.aspx.
I suggest to standardize on lowercase and add the '12' to the appropriate files.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message