tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Antoni Mylka (JIRA)" <j...@apache.org>
Subject [jira] Updated: (TIKA-561) Support EMLX file detection
Date Thu, 25 Nov 2010 19:41:15 GMT

     [ https://issues.apache.org/jira/browse/TIKA-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Antoni Mylka updated TIKA-561:

    Attachment: tika-561.patch

a patch which contains the modifications and the test file, It overlaps with my patch to TIKA-560,
but I wanted to make both of them self-contained.

The test email contains a newsletter from CNET. It's public. I don't know if ASF policy would
allow to commit it. If not, please find someone with Apple Mail and let them create normal
HTML email.

Note that this works only because the priority of the text/html magics has been reduced, as
explained in TIKA-560

> Support EMLX file detection
> ---------------------------
>                 Key: TIKA-561
>                 URL: https://issues.apache.org/jira/browse/TIKA-561
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Antoni Mylka
>         Attachments: tika-561.patch
> Apple Mail generates email files in .emlx format. They roughly resemble standard rfc822
.eml files but are different.
> On the first line they have the content length in bytes,
> then on the second line, normal rfc822 content starts
> and afterwards there is some XML metadata.
> I would suggest to add support for .emlx files to tika-mimetypes.xml. Just copy the message/rfc822
definitions and state that they should appear at offsets 3:10, this should be enough to accomodate
the the content length on the first line. Any reasonable email should be longer than 9 bytes.
In this case the first line would have two bytes, then the line break, and normal rfc822 headers
can start at offset 4. This will work for emails up to 99 MB, (99 999 999 bytes). 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message