tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-461) RFC822 messages not parsed
Date Tue, 28 Sep 2010 11:12:33 GMT

    [ https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915708#action_12915708

Julien Nioche commented on TIKA-461:


Thanks for taking the time to review my patch. 

bq. It'd probably be good to see some more tests with it. For now, just checking your basic
message should be fine, but I'd suggest we also try to get an email with plain text, html,
images and similar in to check the more complex bits.


bq. In terms of the nested parser, I'm tempted to say we do something so that plain text comes
out without any extra work needed. Anything else gets handled via a Parser fetched from the
ParseContext if required, much as we're doing for container formats like zip, .docx etc. That
way, you can throw a simple email at it and get the text, but the rest of the parts are available
if you want them

I hadn't noticed that you've added org.apache.tika.extractor, seems an elegant way of doing.
Will have a closer look and see how I can leverage it in  RFC822Parser

bq.  Also, the james jars need to be listed in the tika bundle pom so they get properly included

Ok, did not know about that. Thanks

> RFC822 messages not parsed
> --------------------------
>                 Key: TIKA-461
>                 URL: https://issues.apache.org/jira/browse/TIKA-461
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Joshua Turner
>            Assignee: Julien Nioche
>         Attachments: TIKA-461.patch
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser produces
an empty body, and a Metadata containing only one key-value pair: "Content-Type=message/rfc822".
Directly calling MboxParser likewise gives an empty body, but with two metadata pairs: "Content-Encoding=us-ascii
> A quick peek at the source of MboxParser shows that the implementation is pretty naive.
If the wiring can be sorted out, something like Apache James' mime4j might be a better bet.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message