tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Letzler (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2478) MBOX import includes redundant copies of the text
Date Tue, 17 Oct 2017 00:31:04 GMT
Robert Letzler created TIKA-2478:

             Summary: MBOX import includes redundant copies of the text
                 Key: TIKA-2478
                 URL: https://issues.apache.org/jira/browse/TIKA-2478
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.16
            Reporter: Robert Letzler
            Priority: Minor

MBOX messages often get parsed into four documents:
a.	The mbox file - outer container "/"
b.	The actual email--  "/embedded-1"
c.	The utf-8 text content of the email "/embedded-1/embedded-2"
d.	The utf-8 html content of the email  "/embedded-1/embedded-3"

entries C and D are redundant and distracting.  The MSG parser parses the first non-null:
email body and then it skips the rest.  Please modify MBOX to not have separate "attached"
documents for the html body and the text body.

The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example of input sufficient
to generate this behavior.


This message was sent by Atlassian JIRA

View raw message