tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Douglas (JIRA)" <j...@apache.org>
Subject [jira] Updated: (TIKA-461) RFC822 messages not parsed
Date Mon, 06 Dec 2010 22:01:16 GMT

     [ https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Benjamin Douglas updated TIKA-461:

    Attachment: TIKA-461-config.patch

Running the current trunk on the Enron email set revealed one weakness of the mime4j default
configuration. The default is to allow any individual header to be at most 1000 characters.
It is easy to exceed this when sending an email to a large group of people. This last patch
ups the limit to 10,000 characters, which should be reasonable for most valid emails.

> RFC822 messages not parsed
> --------------------------
>                 Key: TIKA-461
>                 URL: https://issues.apache.org/jira/browse/TIKA-461
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Joshua Turner
>            Assignee: Julien Nioche
>         Attachments: testRFC822-multipart, TIKA-461-config.patch, TIKA-461-parse.patch,
TIKA-461-plus-tests-1.patch, TIKA-461.patch
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser produces
an empty body, and a Metadata containing only one key-value pair: "Content-Type=message/rfc822".
Directly calling MboxParser likewise gives an empty body, but with two metadata pairs: "Content-Encoding=us-ascii
> A quick peek at the source of MboxParser shows that the implementation is pretty naive.
If the wiring can be sorted out, something like Apache James' mime4j might be a better bet.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message