tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2723) Issue with parsing .mht container
Date Thu, 06 Sep 2018 08:04:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605430#comment-16605430

Nick Burch commented on TIKA-2723:

Looking at your files, the correct mime type really is some sort of email / mime-encoded message
format. They really aren't html or xls files, and can't be opened or processed as html or
xls files (except by programs that transparently decode them on the fly)

At the Tika detection level, we can probably put some more specific magic in, which would
detect them as {{{{application/x-mimearchive}}}} or {{multipart/related}} as a subtype of
{{message/rfc822}} . At the parsing level, I think we'll have to treat them as an email (which
is how they're stored!), and you can then get all the different parts as attachments of the
email (again which is how they're actually stored)

If we had Tika report the mime type of a MHT file saved from IE as {{text/html}}, that'd be
wrong, fail to parse, and mean you couldn't get the embedded other resources (images/css/etc),
so not what you'd want!

[~tallison@apache.org] Do you think going via the RFC822 parser best? Or should be do a special
(related?) mime-based parser for {{{{application/x-mimearchive}}}} or {{multipart/related}}
is the way forward?

> Issue with parsing .mht container
> ---------------------------------
>                 Key: TIKA-2723
>                 URL: https://issues.apache.org/jira/browse/TIKA-2723
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.17
>            Reporter: Ghenadie
>            Priority: Major
>              Labels: patch
>             Fix For: 1.17
>         Attachments: Sample-excel.mht, [TIKA-2723] Issue with parsing _mht container
- ASF JIRA.mht
> Hello,
> I have a file with .mht extension. Tika processes  this file  as an email (Is Email?
- true), and uses RFC822Parser to parse it. As a result, I have the content with email fields,
as: From, To, CC, BCC, Subject. 
> This is an issue for me. And seems to be an issue from Tika. As far as this is a web
container, it should not be parsed through RFCParser (which is an email parser). 
> Please investigate this issue as soon as possible. 
> Please let me know in case of any questions.
> Thank you,
> Ghenadie R.

This message was sent by Atlassian JIRA

View raw message