tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Butler (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-667) Changes to RFC822Parser to support turning off strict parsing
Date Fri, 27 May 2011 10:37:47 GMT
Changes to RFC822Parser to support turning off strict parsing

                 Key: TIKA-667
                 URL: https://issues.apache.org/jira/browse/TIKA-667
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.0
            Reporter: Mark Butler
            Priority: Minor
             Fix For: 1.0
         Attachments: mailparser.diff

Currently in RFC822Parser if Apache-Mime4J fails while parsing any field, then parsing the
whole document will fail. This causes problems on the Enron Corpus - see https://issues.apache.org/jira/browse/TIKA-657

RFC822Parser is configured from a MimeEntityConfig object. MimeEntityConfig contains an option
for "strict parsing". Currently MailContentHandler only performs strict parsing, I.E. if a
MimeException is encountered when processing any fields in MailContentHandler.field then processing
the document fails. However, we may prefer not to have strict parsing I.E. continue even if
processing one or more fields fails. This can be achieved by placing a try / catch block around
the logic inside MailContentHandler.field(), and only rethrowing the error if strictParsing
is enabled, otherwise we log the error.

I enclose a diff for RFC822Parser and MailContentHandler that does this. I have also made
some other minor changes to MailContentHandler: there was some repeated code for handling
To:, Cc: and Bcc: fields, so I have replaced that with a single private method, and rewritten
stripOutFieldPrefix, to avoid manipulating the String using re-assignment. 

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message