tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents
Date Fri, 22 Jul 2016 13:12:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389466#comment-15389466
] 

Tim Allison commented on TIKA-2038:
-----------------------------------

This is great!  I've been wanting to add stripping of html markup because I also found that
that confuses icu4j.

See a comparison on our regression corpus [here|http://162.242.228.174/encoding_detection/].
 ICU4j generally does better than mozilla, but we were getting quite a few incorrect Big5
from ICU4j when mozilla had windows-1252/ISO-8859-1.

Our current algorithm is to run the following in order.  The first one with a non-null answer
is the encoding we choose:
{noformat}
org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector
{noformat}

It looks like you maintain this order, check for charset metaheader first, then detect if
necessary.

Out of curiosity, did you compare the results of your algorithm against the metaheader info?
 Do you have an estimate of how often that info is wrong?


> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as
the other naturally text documents. But the accuracy of encoding detector tools, including
icu4j, in dealing with the HTML documents is meaningfully less than from which the other text
documents. Hence, in our project I developed a library that works pretty well for HTML documents,
which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch,
Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents,
it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message