tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vjeran Marcinko (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1788) message/rfc822 parser doesn't identify attachment filenames from Content-Disposition header
Date Thu, 12 Nov 2015 07:04:11 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001773#comment-15001773

Vjeran Marcinko commented on TIKA-1788:

I dunno James library at all, so cannot say if this would affect negatively some other portion
of the parser, but...

Thing is that current Tika's RFC822Parser sets indirectly James' BasicBodyDescriptor instead
of MaximalBodyDescriptor, and this is due to the way RFC822Parser instantiates james' MimeStreamParser
internally. If this instantiation would be by specifying DefaultBodyDescriptorBuilder:
MimeStreamParser parser = new MimeStreamParser(config, null, new DefaultBodyDescriptorBuilder());
This way during James' parsing, the MaximalBodyDescriptor would be created which recognizes
Content-Disposition field, and it could be utilized in Tika's MailContentHandler, say in body(...)
method if we add:
    public void body(BodyDescriptor body, InputStream is) throws MimeException,
            IOException {
        // use a different metadata object
        // in order to specify the mime type of the
        // sub part without damaging the main metadata

        Metadata submd = new Metadata();
        submd.set(Metadata.CONTENT_TYPE, body.getMimeType());
        submd.set(Metadata.CONTENT_ENCODING, body.getCharset());
        if (body instanceof MaximalBodyDescriptor) {
            MaximalBodyDescriptor maximalBodyDescriptor = (MaximalBodyDescriptor) body;
            String contentDispositionFilename = maximalBodyDescriptor.getContentDispositionFilename();
            if (contentDispositionFilename != null) {
                submd.set(Metadata.RESOURCE_NAME_KEY, contentDispositionFilename);

> message/rfc822 parser doesn't identify attachment filenames from Content-Disposition
> -------------------------------------------------------------------------------------------
>                 Key: TIKA-1788
>                 URL: https://issues.apache.org/jira/browse/TIKA-1788
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.11
>            Reporter: Sergey Tsalkov
>         Attachments: grep_content_disposition.zip
> rfc822 email files can contain attachments as subparts, and they'll
> generally specify the filename of the attachment in a manner like
> this:
> Content-Disposition: attachment;
>         filename*=utf-8''image001.jpg
> Tika doesn't seem to be grabbing that information at all!

This message was sent by Atlassian JIRA

View raw message