tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2050) HTMLEncodingDetector class fails on some HTML documents
Date Wed, 10 Aug 2016 11:06:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415121#comment-15415121

Tim Allison commented on TIKA-2050:

Thank you for carrying out this analysis.  It looks like we should close this ticket as "won't

> HTMLEncodingDetector class fails on some HTML documents
> -------------------------------------------------------
>                 Key: TIKA-2050
>                 URL: https://issues.apache.org/jira/browse/TIKA-2050
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: false-negative-responce-from-HTMLEncodingDetector.zip
> When [~tallison@mitre.org] and I were working on [TIKA-2038|https://issues.apache.org/jira/browse/TIKA-2038]
I found out that HTMLEncodingDetector class cannot extract charsets from some HTML documents.
I’ve attached the HTML documents that HTMLEncodingDetector fails on them. It seems that
its regex should be corrected to cover these cases.

This message was sent by Atlassian JIRA

View raw message