tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Gribov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2685) Email attached to an undeliverable email report are not extracted
Date Mon, 03 Sep 2018 16:59:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602346#comment-16602346
] 

Konstantin Gribov commented on TIKA-2685:
-----------------------------------------

[~tallison@apache.org], and {{multipart/report}} may contain {{text/plain}} or {{text/html}}
as first part.

> Email attached to an undeliverable email report are not extracted
> -----------------------------------------------------------------
>
>                 Key: TIKA-2685
>                 URL: https://issues.apache.org/jira/browse/TIKA-2685
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.18
>            Reporter: Yury Kats
>            Assignee: Tim Allison
>            Priority: Major
>         Attachments: undeliverable.eml
>
>
> I have a number of email messages that are reports of deliverable emails that contain
the original email message as attachment.
> The original emails are parts with "Content-Type: message/rfc822" but are not being recognized
as such.
> Attached is an example email:
>  * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
>  ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
>  
> I would like to see 2 separate emails parsed out (top level undeliverable report, 1st
level attached original email), but I get 1 email and 2 unnamed text attachments:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J  /tmp/undeliverable.eml | python -m json.tool
> [
>     {
>         "Author": "postmaster@bank.com",
>         "Content-Length": "17356",
>         "Content-Type": "message/rfc822",
>         "Creation-Date": "2017-11-04T16:00:11Z",
>         "Message-From": "postmaster@bank.com",
>         "Message-To": "UATAlerting@logscape.com",
>         "Message:From-Email": "postmaster@bank.com",
>         "Message:Raw-Header:Auto-Submitted": "auto-generated",
>         "Message:Raw-Header:MIME-Version": "1.0",
>         "Message:Raw-Header:Message-ID": "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>",
>         "Message:Raw-Header:Return-Path": "<>",
>         "Message:Raw-Header:Sender": "<MicrosoftExchange329e71ec88ae4615bbc36ab6ce41109e@bank.com>",
>         "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
>         "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
>         "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": "\t<1451b918-770a-4d83-b1f9-0c9c0668f1d6@BXTS124020.eu.banknet.com>",
>         "Message:Raw-Header:X-MS-Journal-Report": "",
>         "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
>         "Multipart-Subtype": "mixed",
>         "X-Parsed-By": [
>             "org.apache.tika.parser.DefaultParser",
>             "org.apache.tika.parser.mail.RFC822Parser"
>         ],
>         "X-TIKA:parse_time_millis": "326",
>         "creator": "postmaster@bank.com",
>         "dc:creator": "postmaster@bank.com",
>         "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
>         "dcterms:created": "2017-11-04T16:00:11Z",
>         "meta:author": "postmaster@bank.com",
>         "meta:creation-date": "2017-11-04T16:00:11Z",
>         "resourceName": "undeliverable.eml",
>         "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
>     },
>     {
>         "Content-Encoding": "windows-1252",
>         "Content-Type": "text/plain; charset=windows-1252",
>         "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
>         "Multipart-Subtype": "report",
>         "X-Parsed-By": [
>             "org.apache.tika.parser.DefaultParser",
>             "org.apache.tika.parser.txt.TXTParser"
>         ],
>         "X-TIKA:embedded_resource_path": "/embedded-1",
>         "X-TIKA:parse_time_millis": "4",
>         "embeddedResourceType": "ATTACHMENT"
>     },
>     {
>         "Content-Encoding": "US-ASCII",
>         "Content-Type": "text/html; charset=US-ASCII",
>         "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
>         "Multipart-Subtype": "report",
>         "X-Parsed-By": [
>             "org.apache.tika.parser.DefaultParser",
>             "org.apache.tika.parser.html.HtmlParser"
>         ],
>         "X-TIKA:embedded_resource_path": "/embedded-2",
>         "X-TIKA:parse_time_millis": "7",
>         "embeddedResourceType": "ATTACHMENT"
>     }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message