Jeremy B. Merrill created TIKA-1771:
---------------------------------------
Summary: lower magic priority xhtml magic priority to ensure emails detected
as message/rfc822
Key: TIKA-1771
URL: https://issues.apache.org/jira/browse/TIKA-1771
Project: Tika
Issue Type: Improvement
Components: detector
Reporter: Jeremy B. Merrill
Priority: Critical
Emails I have (happy to share if you want) contain XHTML, as one part of a multipart email.
Prior to this pull request, the priority on the application/xhtml+xml magic detector was 50,
equal to the priority on the message/rfc822 detector. Because of the relative position of
the two detectors in tika-mimetypes.xml, the emails were incorrectly detected as XHTML documents.
With this PR, by downgrading the priority of application/xhtml+xml to 40, the more-sensitive
email magic detectors take precedence, causing the emails to be properly detected as message/rfc822.
I have not run this thru the govdocs tester or anything other than my own documents, so, full
disclosure, this could cause false negative xhtml-detections elsewhere.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
|