tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text
Date Mon, 23 Oct 2017 12:12:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215024#comment-16215024

Tim Allison commented on TIKA-2478:

[~kkrugler], thank you for these notes.  I think that {{testRFC822-multipart}} shows some
of what you describe.  A {{multipart/mixed}} can contain another part, and we can't simply
use a boolean "inPart", but have to use a stack to remember which part we're in and to figure
out which part we're exiting.  My current drafty patch looks for {{multipart/alternative}},
buffers the contents for each alternative and then at the end of the {{multipart/alternative}},
it looks for html, then rtf, then text...The first non-null is the one that gets processed,
and the content is not "inlined," not treated as an embedded document for {{multipart/alternative}}s.
 Any other part is processed as it was before.  Does this sound about right?

{{testRFC822-multipart}} doesn't have plain text (e.g. not a {{multipart/alternative}}) before
and after the .gif.  If you'd be willing to share an example or if I've missed one in our
existing unit tests, it would be helpful to have.  Thank you!

> MBOX import includes redundant copies of the text
> -------------------------------------------------
>                 Key: TIKA-2478
>                 URL: https://issues.apache.org/jira/browse/TIKA-2478
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>            Reporter: Robert Letzler
>            Priority: Minor
> MBOX messages often get parsed into four documents:
> a.	The mbox file - outer container "/"
> b.	The actual email--  "/embedded-1"
> c.	The utf-8 text content of the email "/embedded-1/embedded-2"
> d.	The utf-8 html content of the email  "/embedded-1/embedded-3"
> entries C and D are redundant and distracting.  The MSG parser parses the first non-null:
email body and then it skips the rest.  Please modify MBOX to not have separate "attached"
documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example of input
sufficient to generate this behavior.
> Thanks!

This message was sent by Atlassian JIRA

View raw message