tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2921) Tika discarding bodies of inline MIME elements in RFC822 email
Date Tue, 13 Aug 2019 13:48:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906221#comment-16906221
] 

Tim Allison commented on TIKA-2921:
-----------------------------------


This is what I'm getting as a unit test and when I run {{java -jar tika-app.jar --config=config.xml
file.eml}}.

Is this what you're seeing?  How, exactly, are you calling Tika and/or including dependencies?



{noformat}
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Message:Raw-Header:X-Spam-Status" content="No, score=-2.099 tagged_above=-999
required=5&#9;tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,&#9;DKIM_VALID_AU=-0.1,
DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001,&#9;HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=ham
autolearn_force=no"/>
<meta name="subject" content="Re: website issue?"/>
<meta name="dc:creator" content="Norman Dimock &lt;dimockn@gmail.com&gt;"/>
<meta name="Message:Raw-Header:X-Received" content="by 2002:a1f:8b48:: with SMTP id n69mr3322403vkd.12.1547641463203;
Wed, 16 Jan 2019 04:24:23 -0800 (PST)"/>
<meta name="Message:From-Email" content="dimockn@gmail.com"/>
<meta name="dcterms:created" content="2019-01-16T12:24:10Z"/>
<meta name="Message-To" content="Josh Turner &lt;jturner@handshape.com&gt;"/>
<meta name="Message:Raw-Header:Authentication-Results" content="mail.handshape.com (amavisd-new);&#9;dkim=pass
(2048-bit key) header.d=gmail.com"/>
<meta name="Message:Raw-Header:X-Google-DKIM-Signature" content="v=1; a=rsa-sha256; c=relaxed/relaxed;
       d=1e100.net; s=20161025;        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
        :message-id:subject:to;        bh=ImRnwxGjgAUe17miKW5RkSb+P41jBHp5BWiDMnxmb+8=;  
     b=n0Ql87INTq9Mjgp8dmEhGP8wE9MCZX/a0WQ876dzW++ic5nCMlnhw9j0c09oXIS5hA         VQ6QqeS384BEDtY6oROMn63O8GsQncbpXyamUhg0LMWzOhKhY3iWWawd2h6i+EeYoJEg
        8k+vAFVJU70vtGNLu3GHU477Shw1nFQGhEWccZu68lxkMX9joFEGGUtyJLnH4GqKzYbC         vfhpVgr1pxeOiaU+4Cdth9e+4WLnR9T983q3F5D36NS9tnkcH4LMhhkfEca8raF2MTzX
        g+f8idp3OgiIuGMAOd99Go/nK4vTASix8hCSpnEsbzYKcH5bv0o3dFLN64RJQeIkPUte         G4nA=="/>
<meta name="Message:Raw-Header:X-Gm-Message-State" content="AJcUukcWtkn1r1vSPsnQJF/GJiB2lFaDUgfyVAbbsih6aQt1qbyiN4EW&#9;fJEZFoU2CuQvQn82Lhd0aknLAeFMZ6xkngJtpYU4rA=="/>
<meta name="Message:Raw-Header:X-Virus-Scanned" content="Debian amavisd-new at handshape.com"/>
<meta name="Message:Raw-Header:MIME-Version" content="1.0"/>
<meta name="Multipart-Boundary" content="000000000000a76ce1057f925b48"/>
<meta name="Message:Raw-Header:Message-ID" content="&lt;CAMpLFpCimic+dGB4-zpNRBizbP1uNpFTw=3dvRzAWOpEUi5aPw@mail.gmail.com&gt;"/>
<meta name="dc:title" content="Re: website issue?"/>
<meta name="Message:Raw-Header:X-Spam-Flag" content="NO"/>
<meta name="Message:Raw-Header:In-Reply-To" content="&lt;CAMpLFpCVygEwb+t=FmD6TqiDLrQHkREvh=_2=ZinF8WH1-yxbQ@mail.gmail.com&gt;"/>
<meta name="Content-Length" content="4107"/>
<meta name="Message:Raw-Header:X-Spam-Level" content=""/>
<meta name="Content-Type" content="message/rfc822"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.mail.RFC822Parser"/>
<meta name="creator" content="Norman Dimock &lt;dimockn@gmail.com&gt;"/>
<meta name="Message:Raw-Header:X-Original-To" content="jturner@handshape.com"/>
<meta name="meta:author" content="Norman Dimock &lt;dimockn@gmail.com&gt;"/>
<meta name="Message:Raw-Header:X-Google-Smtp-Source" content="ALg8bN7D3XZNSh8tgBFuosEPt01e12Ue8kk4R9OVClU5OHsa+NcWnqcd1JrII4+rSJNSjwaNu8oppTqZiSi1OMUCNfQ="/>
<meta name="meta:creation-date" content="2019-01-16T12:24:10Z"/>
<meta name="Message:Raw-Header:References" content="&lt;CAMpLFpB=uu_mqGwf5RToWqCfkd9cmZKBoJ782872YDgfp1d2sA@mail.gmail.com&gt;
&lt;CAMpLFpCVygEwb+t=FmD6TqiDLrQHkREvh=_2=ZinF8WH1-yxbQ@mail.gmail.com&gt;"/>
<meta name="Creation-Date" content="2019-01-16T12:24:10Z"/>
<meta name="resourceName" content="TIKA-2921.eml"/>
<meta name="Message:Raw-Header:Return-Path" content="&lt;dimockn@gmail.com&gt;"/>
<meta name="Message:Raw-Header:X-Spam-Score" content="-2.099"/>
<meta name="Message:Raw-Header:DKIM-Signature" content="v=1; a=rsa-sha256; c=relaxed/relaxed;
       d=gmail.com; s=20161025;        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
       bh=ImRnwxGjgAUe17miKW5RkSb+P41jBHp5BWiDMnxmb+8=;        b=GA7HxxV7NFyCliid7O5w68Pyl+El9pLalsedSV28GjdrjXjAABu12zB+OWjB2lVGBr
        +gNyuAM0zcvHiwVQdlqa6ddq5D+UGT7ppzKDSh8ZTctt89tdmHFMuTECMB93xD8lOFVD         tXoRJjD+bkd9NX18/8whrcweh/WeK7hai+02ZYLrtIxwsrCbfGdm/pY+KgDcHjs3OB/p
        lQJzFJHCgNCZ7oVR+T63RE+YMWfGs1sKIkjB2iIXByZseLR10afCxnBAfkg9Y/Cyjoep         UE6B/4GngonMFO1Qwp55Ym5LcWMNORlIv6hrLwGglz+Rvs84EsFI0EY0hVVpQnB2H5UF
        7/dg=="/>
<meta name="Message:Raw-Header:Delivered-To" content="jturner@handshape.com"/>
<meta name="Message:From-Name" content="Norman Dimock"/>
<meta name="Author" content="Norman Dimock &lt;dimockn@gmail.com&gt;"/>
<meta name="Multipart-Subtype" content="alternative"/>
<meta name="Message:Raw-Header:Received" content="from localhost (localhost [127.0.0.1])&#9;by
handshape.com (Postfix) with ESMTP id 3E3A334690E&#9;for &lt;jturner@handshape.com&gt;;
Wed, 16 Jan 2019 07:24:26 -0500 (EST)"/>
<meta name="Message:Raw-Header:Received" content="from handshape.com ([127.0.0.1])&#9;by
localhost (mail.handshape.com [127.0.0.1]) (amavisd-new, port 10024)&#9;with ESMTP id
1iIzkulfL3MZ for &lt;jturner@handshape.com&gt;;&#9;Wed, 16 Jan 2019 07:24:24 -0500
(EST)"/>
<meta name="Message:Raw-Header:Received" content="from mail-vk1-f175.google.com (mail-vk1-f175.google.com
[209.85.221.175])&#9;(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))&#9;(No
client certificate requested)&#9;by handshape.com (Postfix) with ESMTPS id 5771734690D&#9;for
&lt;jturner@handshape.com&gt;; Wed, 16 Jan 2019 07:24:24 -0500 (EST)"/>
<meta name="Message:Raw-Header:Received" content="by mail-vk1-f175.google.com with SMTP
id 197so1371013vkf.4        for &lt;jturner@handshape.com&gt;; Wed, 16 Jan 2019 04:24:24
-0800 (PST)"/>
<meta name="Message-From" content="Norman Dimock &lt;dimockn@gmail.com&gt;"/>
<title>Re: website issue?</title>
</head>
<body><blockquote>.. twice, I've done that!</blockquote>




</body></html>
{noformat}

> Tika discarding bodies of inline MIME elements in RFC822 email
> --------------------------------------------------------------
>
>                 Key: TIKA-2921
>                 URL: https://issues.apache.org/jira/browse/TIKA-2921
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.22
>         Environment: Reproducible on Java 8 and 11 on both Linux and Win 10.
>            Reporter: Joshua Turner
>            Priority: Major
>         Attachments: tika-2921.xml
>
>
> Given an rfc822 email that has two inline body parts (such as the one attached), MailContentHandler's
handleInlineBodyPart() method correctly identifies the body part that should be emitted as
the principal content of the mail item, but then uses EmbeddedDocumentUtil.tryToFindExistingLeafParser()
to find a parser for that part. If no existing leaf parser is found, it simply gives up and
treats the given part as an attachment.
> IMHO, the correct behaviour would be to create the necessary parser if none is found,
insert it into the parsing context, and use it to extract the content of the selected body
part.
> In the meantime, I'm working around the issue by creating and registering a custom EmbeddedDocumentExtractor
to guess whether it's been called by the RFC822Parser by looking at the "X-Parsed-By" metadata
value. When triggered, it looks at the Content-Type of the passed-in metadata, and if it's
plain text or email, it creates a new TXTParser or HTMLParser and a new context, and has them
parse into the passed-in ContentHandler. It works, but it's pretty hacky. It'd be far better
to have the change in behaviour suggested above. 
> [^test.eml]
> ^I've attached the email inline because using the attachment field yields an error: "JIRA
could not attach the file as there was a missing token. Please try attaching the file again."
I tried twice with the same error returned.^



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Mime
View raw message