tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2921) Tika discarding bodies of inline MIME elements in RFC822 email
Date Tue, 13 Aug 2019 14:54:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906286#comment-16906286
] 

Tim Allison edited comment on TIKA-2921 at 8/13/19 2:53 PM:
------------------------------------------------------------

Oh...ok. The TikaCLI UI uses the BoilerpipeContentHandler under the hood, and that strips
out stuff that the BoilerpipeContentHandler thinks is, well, boilerpipe.

So, when I run this:
{noformat}
        TikaConfig tikaConfig;
        try (InputStream is = getStream("org/apache/tika/parser/mail/tika-2921.xml")) {
            tikaConfig = new TikaConfig(is);
        }
        ContentHandler inner = new ToXMLContentHandler();
        ContentHandler handler = new BoilerpipeContentHandler(inner);
        try (InputStream tis = getStream("test-documents/TIKA-2921.eml")) {
            new AutoDetectParser(tikaConfig).parse(tis, handler, new Metadata(), new ParseContext());
        }
        System.out.println(inner);
{noformat}

I get the same as you get in your screen cap:
{noformat}
<head><metadata.../><title>Re: website issue?</title></head><body><blockquote
/></html>
{noformat}

Note, however, that when you click on "Recursive JSON" in the UI, the text is there because
we don't use the BoilerPipeContentHandler in the RecursiveParserWrapper. :P



was (Author: tallison@mitre.org):
Oh...ok. The TikaCLI UI uses the BoilerpipeContentHandler under the hood, and that strips
out stuff that the BoilerpipeContentHandler thinks is, well, boilerpipe.

So, when I run this:
{noformat}
        TikaConfig tikaConfig;
        try (InputStream is = getStream("org/apache/tika/parser/mail/tika-2921.xml")) {
            tikaConfig = new TikaConfig(is);
        }
        ContentHandler inner = new ToXMLContentHandler();
        ContentHandler handler = new BoilerpipeContentHandler(inner);
        try (InputStream tis = getStream("test-documents/TIKA-2921.eml")) {
            new AutoDetectParser(tikaConfig).parse(tis, handler, new Metadata(), new ParseContext());
        }
        System.out.println(inner);
{noformat}

I get this:
{noformat}
<head><metadata.../><title>Re: website issue?</title></head><body><blockquote
/></html>
{noformat}

> Tika discarding bodies of inline MIME elements in RFC822 email
> --------------------------------------------------------------
>
>                 Key: TIKA-2921
>                 URL: https://issues.apache.org/jira/browse/TIKA-2921
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.22
>         Environment: Reproducible on Java 8 and 11 on both Linux and Win 10.
>            Reporter: Joshua Turner
>            Priority: Major
>         Attachments: tika-2921.xml
>
>
> Given an rfc822 email that has two inline body parts (such as the one attached), MailContentHandler's
handleInlineBodyPart() method correctly identifies the body part that should be emitted as
the principal content of the mail item, but then uses EmbeddedDocumentUtil.tryToFindExistingLeafParser()
to find a parser for that part. If no existing leaf parser is found, it simply gives up and
treats the given part as an attachment.
> IMHO, the correct behaviour would be to create the necessary parser if none is found,
insert it into the parsing context, and use it to extract the content of the selected body
part.
> In the meantime, I'm working around the issue by creating and registering a custom EmbeddedDocumentExtractor
to guess whether it's been called by the RFC822Parser by looking at the "X-Parsed-By" metadata
value. When triggered, it looks at the Content-Type of the passed-in metadata, and if it's
plain text or email, it creates a new TXTParser or HTMLParser and a new context, and has them
parse into the passed-in ContentHandler. It works, but it's pretty hacky. It'd be far better
to have the change in behaviour suggested above. 
> [^test.eml]
> ^I've attached the email inline because using the attachment field yields an error: "JIRA
could not attach the file as there was a missing token. Please try attaching the file again."
I tried twice with the same error returned.^



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Mime
View raw message