tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents
Date Sun, 13 Dec 2009 00:25:18 GMT

     [ https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jukka Zitting resolved TIKA-332.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.6
         Assignee: Jukka Zitting

Patches applied in revision 890009.

> Use http-equiv meta tag charset info when processing HTML documents
> -------------------------------------------------------------------
>
>                 Key: TIKA-332
>                 URL: https://issues.apache.org/jira/browse/TIKA-332
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>            Priority: Critical
>             Fix For: 0.6
>
>         Attachments: TIKA-332-2.patch, TIKA-332.patch
>
>
> Currently Tika doesn't use the charset info that's optionally present in HTML documents,
via the <meta http-equiv="Content-type" content="text/html; charset=xxx"> tag.
> If the mime-type is detected as being one that's handled by the HtmlParser, then the
first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex
something like:
>     private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\"");
> If a charset is detected, this should take precedence over a charset in the HTTP response
headers, and (obviously) used to convert the bytes to text before the actual parsing of the
document begins.
> In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta
tag that wound up being different from the detected or HTTP response header charset, so this
is a pretty important improvement to make. Without it, Tika isn't that useful for processing
HTML pages.
> Though the other problem is that the HtmlParser code doesn't use the CharsetDetector,
which is another reason for lots of incorrect text. I'll file a separate issue about that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message