tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
Date Tue, 23 Dec 2014 13:40:13 GMT

     [ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Sebastian Nagel updated TIKA-879:
    Attachment: TIKA-879-thunderbird.eml

I've run into the same problem with an .eml file written by Thunderbird (see attachment).

RFC822 states (http://tools.ietf.org/html/rfc822#section-4.1) that header fields can appear
in any order:
Note: Due to an artifact of the notational conventions, the syntax indicates that, when present,
some fields, must be in a particular order.  Header fields are NOT required to occur in any
particular order, except that the message body must occur AFTER the headers.
If one of the "optional" fields (according to RFC822), esp. "extension-field" ("X-...") or
any "user-defined-field", is the first field in the header the mime magic does not work. 

Adding {{<sub-class-of type="text/plain"/>}} would solve the problem only partially:
if any text file is named *.eml, it is always recognized as message/rfc822 independent from
its content. Is the file name/extension a strong indicator?

Or would it be possible to relax the MIME magic and allow additional header fields at the
* check for the {{field: value}} structure first
* then check for (some) required fields ("Date:", "From:") but also if not immediately at

> Detection problem: message/rfc822 file is detected as text/plain.
> -----------------------------------------------------------------
>                 Key: TIKA-879
>                 URL: https://issues.apache.org/jira/browse/TIKA-879
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, mime
>    Affects Versions: 1.0, 1.1, 1.2
>         Environment: linux 3.2.9
> oracle jdk7, openjdk7, sun jdk6
>            Reporter: Konstantin Gribov
>         Attachments: TIKA-879-thunderbird.eml
> When using {{DefaultDetector}} mime type for {{.eml}} files is different (you can test
it on {{testRFC822}} and {{testRFC822_base64}} in {{tika-parsers/src/test/resources/test-documents/}}).
> Main reason for such behavior is that only magic detector is really works for such files.
Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} file name in {{RESOURCE_NAME_KEY}}.
> As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", "text/plain")}} returns
{{false}}, so detection by {{MimeTypes.detect(...)}} works only by magic.

This message was sent by Atlassian JIRA

View raw message