tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luis Filipe Nassif (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text
Date Tue, 17 Oct 2017 23:11:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208562#comment-16208562

Luis Filipe Nassif commented on TIKA-2478:

Robert, related to your last suggestion, I think mbox/rfc822 are correct. Outookparser should
not print those fileds at the top of body text, because they are not part of the body. 

To achive email client similar preview, you can configure a custom ContentHandler to print
those fields to output before the body.

> MBOX import includes redundant copies of the text
> -------------------------------------------------
>                 Key: TIKA-2478
>                 URL: https://issues.apache.org/jira/browse/TIKA-2478
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>            Reporter: Robert Letzler
>            Priority: Minor
> MBOX messages often get parsed into four documents:
> a.	The mbox file - outer container "/"
> b.	The actual email--  "/embedded-1"
> c.	The utf-8 text content of the email "/embedded-1/embedded-2"
> d.	The utf-8 html content of the email  "/embedded-1/embedded-3"
> entries C and D are redundant and distracting.  The MSG parser parses the first non-null:
email body and then it skips the rest.  Please modify MBOX to not have separate "attached"
documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example of input
sufficient to generate this behavior.
> Thanks!

This message was sent by Atlassian JIRA

View raw message