tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shabanali Faghani (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2050) HTMLEncodingDetector class fails on some HTML documents
Date Tue, 09 Aug 2016 20:52:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414216#comment-15414216

Shabanali Faghani commented on TIKA-2050:

I tested it again. All charset information in these docs are at indices greater than 8192
and the regex isn’t at fault. Since most of these indices are greater than 15,000 I think
increasing buffer size wouldn’t be a good idea, however that is a trade-off between accuracy
and efficiency.
For those GBK docs that have more than one charset in their Metas, I’ve tested them with
Chrome and Firefox. I found out Chrome has a high level of self-confidence because it doesn’t
use charset information at all but its confidence, at least in these cases, doesn’t help
it and it detects these docs as Western (Windows-1252). On the other hand Firefox extracts
and uses the first charset appearing in Meta tags for decoding the pages. Hence, it seems
that selecting the first charset is a kind of best-practice in this context. Nevertheless,
we know that this method fails in some cases, like our case for the attached GBK docs. Maybe
extracting all charsets from Meta tags and then selecting the one that has least popularity/usage
would be a better solution.

bq. Do you see any other problems besides our buffer length?
I was also suspected to those charsets that appear in Script tags to being false-positives
of this class, but when I checked its regex I found out that isn’t the case. I don’t see
any other problem; just I would to say that your regex approach in this class is ~18x faster
than my Dom-Tree navigating approach! 

 I’m not sure; but probably our approach in TIKA-2038 is even more accurate than Meta detection
and also more accurate than algorithms of some Browsers!! (Remove charset information in meta
tags for some docs, e.g. Windows-1256, GBK, …, and then open them using some browsers to
test it) So, I think even if HTMLEncodingDetector class couldn’t extract existing charsets
from Meta tags, we shouldn’t be worry!, that isn't that important.

> HTMLEncodingDetector class fails on some HTML documents
> -------------------------------------------------------
>                 Key: TIKA-2050
>                 URL: https://issues.apache.org/jira/browse/TIKA-2050
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: false-negative-responce-from-HTMLEncodingDetector.zip
> When [~tallison@mitre.org] and I were working on [TIKA-2038|https://issues.apache.org/jira/browse/TIKA-2038]
I found out that HTMLEncodingDetector class cannot extract charsets from some HTML documents.
I’ve attached the HTML documents that HTMLEncodingDetector fails on them. It seems that
its regex should be corrected to cover these cases.

This message was sent by Atlassian JIRA

View raw message