tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Butler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-657) Email parser gets into trouble on malformed html in enron corpus
Date Fri, 27 May 2011 09:41:48 GMT

    [ https://issues.apache.org/jira/browse/TIKA-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040149#comment-13040149

Mark Butler commented on TIKA-657:

I took the Enron dataset and processed it using Tika and Behemoth. It contains 517,424 documents.

Using Tika 0.9 I encountered runtime errors on 27,224 documents. Sorting the exceptions, there
were four different stack traces. I enclose a summary of these exceptions below. However I
did not see the problems with Tagsoup parsing that Benson reports? 

I then took the version of Tika in head. Here I encountered run time errors on 1,218 documents.
I enclose a summary of these exceptions below also. There were two sources of error. First,
the Enron corpus contains emails with lines longer than the default 10,000 characters used
in the RFC822Parser parser. The other problem is that the Enron corpus contains malformed
dates, which cause apache-mime4j to throw a MimeException. 

The first problem is easily fixed because RFC822Parser is configured from a MimeEntityConfig
object, so passing in an object with a higher MaxLineLen - e.g. 60,000 - avoids these exceptions.
I noticed that MimeEntityConfig also contains an option for "strict parsing". Currently MailContentHandler
only performs strict parsing, i.e. if a MimeException is encountered when processing any fields
in MailContentHandler.field then it is passed back up and processing the document fails. However,
we may prefer not to have strict parsing i.e. continue even if processing one or more fields
fails. This can be achieved by placing a try / catch block around the logic inside MailContentHandler.field(),
and only rethrowing the error if strictParsing is enabled, otherwise we log the error.

I then re-ran this on the entire corpus and it parsed successfully.

> Email parser gets into trouble on malformed html in enron corpus
> ----------------------------------------------------------------
>                 Key: TIKA-657
>                 URL: https://issues.apache.org/jira/browse/TIKA-657
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Benson Margulies
>            Assignee: Julien Nioche
> There is a very large corpus of email addresses available: http://www.cs.cmu.edu/~enron/.
> In processing even a subset of this corpus, I see numerous 'unexpected RuntimeException'
errors resulting from tagsoup throwing on truly awful html. It seems to me that being able
to do something with this entire stack would make a good '1.0' criteria for tika's email parser.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message