tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1602) Detecting standards-non-compliant emails as message/rfc822
Date Wed, 01 Jul 2015 10:52:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609915#comment-14609915
] 

Tim Allison commented on TIKA-1602:
-----------------------------------

+1.

This feels hacky, but we can undo it.  Govdocs1 is limited, and our mileage will vary.  Hopefully,
someone will have the time to work on TIKA-879 soon.

[~jeremybmerrill], I'm sorry for taking so long to getting around to running this simple test.
 Out of curiosity, what other headers were you getting in that batch of emails?  I'm wondering
if there are more specific rfc822'ish headers that we could rely on, or were you only getting
"Status:"?

> Detecting standards-non-compliant emails as message/rfc822
> ----------------------------------------------------------
>
>                 Key: TIKA-1602
>                 URL: https://issues.apache.org/jira/browse/TIKA-1602
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Jeremy B. Merrill
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: 036491.txt.zip
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Tika does not properly detect certain emails as `message/rfc822` if they're slightly
standards-non-compliant and begin with `Status: ` as the first header. I've added `Status:
` as a magic detection line in tika-mimetypes.xml. 
> This solves my problem and does not appear to cause unit test failures. I have not yet
run the tika-batch tests.
> As further information, the emails that are processed incorrectly come from dumps directly
from various US public officials' mailservers. The dumps, I believe since they're not intended
to be transmitted over the wire, sometimes are slightly non-compliant. 
> It's important to note that Tika (and the underlying library, James Mime4J) do properly
*parse* these emails, despite the non-compliant header. The problem is getting Tika to *detect*
the file as an email so that Mime4J gets chosen to parse it.
> Pull request on Github at https://github.com/apache/tika/pull/40



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message