tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Trucco (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2037) Problems with email attachments
Date Wed, 20 Jul 2016 15:58:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386079#comment-15386079

Eli Trucco commented on TIKA-2037:

Thanks, Tim! Another thing I noticed is, in the example code ExtractEmbeddedFiles.java (link
above), the input stream should first be wrapped in TikaInputStream, otherwise RFC822 Parser
will throw an exception because it doesn't support mark/reset. 

> Problems with email attachments
> -------------------------------
>                 Key: TIKA-2037
>                 URL: https://issues.apache.org/jira/browse/TIKA-2037
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>    Affects Versions: 1.13
>         Environment: Eclipse, Java 8
>            Reporter: Eli Trucco
>            Priority: Minor
>         Attachments: CameraCalibration.eml, Exkursion.eml
> I stumbled across a couple of problems while parsing and extracting attachments from
.eml files from Thunderbird. Some of them are wrongly identified (as text/html, or application/xhtml+xml)
and in a lot of them, the attachments are not detected. I tried to parse 20 random eml files
with attachments (pdf,txt,html,etc), and at least 10 of them are either identified as html,
or correctly identified as rfc822 but the attachments are not extracted. I tried the same
files using TikaCLI -z option with the same result.
> What I did: I extended the class ParsingEmbeddedDocumentExtractor to extract and store
the attachments somewhere else (exactly as shown in this example code https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java).

This message was sent by Atlassian JIRA

View raw message