tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2050) HTMLEncodingDetector class fails on some HTML documents
Date Mon, 08 Aug 2016 11:11:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15411675#comment-15411675

Tim Allison commented on TIKA-2050:

Thank you for opening this.  It would be helpful if you summarized the problems, submitted
unit tests and/or patches.  I'm not sure the regex is at fault, but please let me know if
it is.

>From what I see:
# charset is defined after 21,000 characters (well beyond our current buffer of 8192)
# two different charsets are defined (again, well beyond our current buffer).  By default,
we pick the first IIRC.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

Not sure there's a good way of fixing the second, but we could increase the length of the

Do you see any other problems besides our buffer length?

> HTMLEncodingDetector class fails on some HTML documents
> -------------------------------------------------------
>                 Key: TIKA-2050
>                 URL: https://issues.apache.org/jira/browse/TIKA-2050
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: false-negative-responce-from-HTMLEncodingDetector.zip
> When [~tallison@mitre.org] and I were working on [TIKA-2038|https://issues.apache.org/jira/browse/TIKA-2038]
I found out that HTMLEncodingDetector class cannot extract charsets from some HTML documents.
I’ve attached the HTML documents that HTMLEncodingDetector fails on them. It seems that
its regex should be corrected to cover these cases.

This message was sent by Atlassian JIRA

View raw message