tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Gribov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
Date Mon, 03 Sep 2018 17:12:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602361#comment-16602361
] 

Konstantin Gribov commented on TIKA-2680:
-----------------------------------------

Just my 2c, I've stopped using Tika for RFC822 parsing somewhere in 2012-2013 and using mime4j
directly for RFC822 and delegate attachment parsing to Tika. But in my case I know beforehand
what I'll parse (normal files, plain emls, emls with external metadata from DLP system or
MSE journaled emls) so I can parse them with specific parser. Of course I have to track if
I'm parsing an attachment (set/reset flag in field handler if {{Content-Disposition}} found
with/without it; and reset flag in {{startBodyPart}}) and current depth in multipart tree
handling.

> Email attachments to an email are not extracted
> -----------------------------------------------
>
>                 Key: TIKA-2680
>                 URL: https://issues.apache.org/jira/browse/TIKA-2680
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.18
>            Reporter: Yury Kats
>            Assignee: Tim Allison
>            Priority: Major
>         Attachments: main_email_in_outlook.jpg, nested.eml
>
>
> I have a number of email messages that contain other email messages as attachments (with
multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are not being
recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level attached email,
2nd level attached email), but I only get 1 email and 1 unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) <Henry.Van.der.Smith@bank.com>",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) <Henry.Van.der.Smith@bank.com>",
> "Message-To": [
> "fm.SAN Management Team <fm.SANManagementTeam@bank.com>",
> "Smith Van der, H (Henry) <Henry.Van.der.Smith@bank.com>"
> ],
> "Message:From-Email": "Henry.Van.der.Smith@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": "<ab2078ea-fd2f-4b28-bc8d-451916369b3c@journal.report.generator>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": "<MicrosoftExchange329e71ec88ae4615bbc36ab6ce41109e@bank.com>",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": "<0fab98cd190c41f199a25c73f78a2070@BSTS124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) <Henry.Van.der.Smith@bank.com>",
> "dc:creator": "Smith Van der, H (Henry) <Henry.Van.der.Smith@bank.com>",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) <Henry.Van.der.Smith@bank.com>",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message