tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents
Date Mon, 01 Aug 2016 12:52:21 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tim Allison updated TIKA-2038:
------------------------------
    Attachment: iust_encodings.zip

This includes the encodings as detected by: 1) Tika default, 2) HTML alone, 3) UniversalCharDet
alone, 4) ICU4J alone

There are only 77 files for which the HTML detector is not able to extract an encoding in
this set.  If we make the assumption that the html meta-header is most often correct, and
use that as "ground truth" (with caveats!), we see the following when comparing to the other
two detectors.

Comparisons of UniversalCharDet to the HTMLDetector:
||HTMLDetector||UniversalEncodingDetector||Count||
|UTF-8|windows-1252|437|
|windows-1256|windows-1252|340|
|GBK|GB18030|320|
|windows-1256|x-MacCyrillic|159|
|GB2312|GB18030|77|
|windows-1256|NULL|34|
|windows-1256|ISO-8859-1|22|
|windows-1256|ISO-8859-5|17|
|windows-1256|KOI8-R|16|
|UTF-8|ISO-8859-1|16|
|Shift_JIS|NULL|5|
|windows-1252|x-MacCyrillic|5|
|windows-1251|x-MacCyrillic|4|
|GBK|windows-1252|3|
|windows-1256|windows-1255|3|
|Shift_JIS|windows-1252|2|
|UTF-8|GB18030|2|
|windows-1256|UTF-8|2|
|ISO-8859-1|x-MacCyrillic|1|
|UTF-8|windows-1251|1|
|windows-1256|IBM866|1|
|HTMLDetector|UniversalEncodingDetector|1|
|ISO-8859-1|windows-1252|1|
|windows-1256|ISO-8859-7|1|
|windows-1256|ISO-8859-8|1|

Comparisons of ICU4J to the HTMLDetector:
||HTMLDetector||ICU4J||Count||
|UTF-8|ISO-8859-1|465|
|windows-1256|ISO-8859-1|397|
|GBK|GB18030|314|
|windows-1251|ISO-8859-1|232|
|GB2312|GB18030|77|
|windows-1256|windows-1252|10|
|windows-1252|ISO-8859-1|7|
|GBK|ISO-8859-1|7|
|windows-1251|windows-1252|3|
|ISO-8859-1|windows-1252|2|
|UTF-8|GB18030|2|
|windows-1256|ISO-8859-2|2|
|windows-1256|UTF-16LE|1|
|ISO-8859-1|windows-1256|1|
|windows-1256|Big5|1|
|windows-1256|ISO-8859-9|1|
|GBK|windows-1252|1|
|GBK|EUC-KR|1|
|HTMLDetector|Icu4jEncodingDetector|1|
|UTF-8|windows-1252|1|



> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: iust_encodings.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as
the other naturally text documents. But the accuracy of encoding detector tools, including
icu4j, in dealing with the HTML documents is meaningfully less than from which the other text
documents. Hence, in our project I developed a library that works pretty well for HTML documents,
which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch,
Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents,
it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message