tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers
Date Fri, 20 Oct 2017 17:33:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212929#comment-16212929

Tim Allison commented on TIKA-2471:

It looks like [~kkrugler]'s original mbox contribution on TIKA-295 predates mime4j's MboxIterator
by three years.

Are there any reasons not to trust mime4j's MboxIterator?

Looks like it hasn't hasn't had much activity in the last few years.

Should we try to integrate it?

> Tab-prefixed message body lines in Mbox interpreted as headers
> --------------------------------------------------------------
>                 Key: TIKA-2471
>                 URL: https://issues.apache.org/jira/browse/TIKA-2471
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.16
>            Reporter: Matthew Caruana Galizia
>              Labels: message, rfc822
>         Attachments: mbox
> The mbox parser code is overly optimistic. It parses the entire message looking for anything
that matches a header pattern, wherever it occurs in a line!
> It looks to me like the parsing logic is in desperate need of a refactor. But more to
the point, what is the idea behind setting the headers in the MboxParser if they're going
to be set by the RFC822Parser in any case?
> Also, out of curiosity, why does the parser force Windows-1252 as the charset?

This message was sent by Atlassian JIRA

View raw message