tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shabanali Faghani (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2050) HTMLEncodingDetector class fails on some HTML documents
Date Wed, 10 Aug 2016 20:57:21 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415997#comment-15415997

Shabanali Faghani commented on TIKA-2050:

You're welcome. I agree to close this issue as “won’t fix”.

> HTMLEncodingDetector class fails on some HTML documents
> -------------------------------------------------------
>                 Key: TIKA-2050
>                 URL: https://issues.apache.org/jira/browse/TIKA-2050
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: false-negative-responce-from-HTMLEncodingDetector.zip
> When [~tallison@mitre.org] and I were working on [TIKA-2038|https://issues.apache.org/jira/browse/TIKA-2038]
I found out that HTMLEncodingDetector class cannot extract charsets from some HTML documents.
I’ve attached the HTML documents that HTMLEncodingDetector fails on them. It seems that
its regex should be corrected to cover these cases.

This message was sent by Atlassian JIRA

View raw message