tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text
Date Mon, 23 Oct 2017 00:29:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214491#comment-16214491
] 

Ken Krugler commented on TIKA-2478:
-----------------------------------

I recently had to dig into extracting text from emails, and it isn't all that straightforward.
E.g. you _do_ want to combine text from {{multipart/mixed}}, but not {{multipart/alternative}},
where you generally want to favor text over HTML. There's also {{multipart/related}} and {{multipart/signed}}.
And each {{multipart/mixed}} piece has to be evaluated to extract (potential) text; typically
where I've seen this is with an inlined image, so you get text/html, image/whatever, text/html.

> MBOX import includes redundant copies of the text
> -------------------------------------------------
>
>                 Key: TIKA-2478
>                 URL: https://issues.apache.org/jira/browse/TIKA-2478
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>            Reporter: Robert Letzler
>            Priority: Minor
>
> MBOX messages often get parsed into four documents:
> a.	The mbox file - outer container "/"
> b.	The actual email--  "/embedded-1"
> c.	The utf-8 text content of the email "/embedded-1/embedded-2"
> d.	The utf-8 html content of the email  "/embedded-1/embedded-3"
> entries C and D are redundant and distracting.  The MSG parser parses the first non-null:
email body and then it skips the rest.  Please modify MBOX to not have separate "attached"
documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example of input
sufficient to generate this behavior.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message