tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents
Date Wed, 03 Aug 2016 18:51:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406408#comment-15406408
] 

Tim Allison edited comment on TIKA-2038 at 8/3/16 6:51 PM:
-----------------------------------------------------------

I wrote a markup stripper that ignores content in tags, comments, <style> and <script>
elements.  I then compared:
# Tika's default detection algorithm
# The proposed detection algorithm
# HTMLEncodingDetector
# UniversalEncodingDetector
# UniversalEncodingDetector (on input that had been stripped)
# ICU4J
# ICU4J (on input that had been stripped)

After we do some more evaluation, I propose that we move to this order: 
HTMLEncodingDetector
ICU4J with added stripping

The performance on ICU4J improves dramatically if we strip the style/script info, and this
is in line with [~faghani] et al's finding.

Let me know what you think...



was (Author: tallison@mitre.org):

I wrote a markup stripper that ignores content in tags, comments, <style> and <script>
elements.  I then compared:
#. Tika's default detection algorithm
#. The proposed detection algorithm
#. HTMLEncodingDetector
#. UniversalEncodingDetector
#. UniversalEncodingDetector (on input that had been stripped)
#. ICU4J
#. ICU4J (on input that had been stripped)

After we do some more evaluation, I propose that we move to this order: 
HTMLEncodingDetector
ICU4J with added stripping

The performance on ICU4J improves dramatically if we strip the style/script info, and this
is in line with [~faghani] et al's finding.

Let me know what you think...


> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803.xlsx, iust_encodings.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as
the other naturally text documents. But the accuracy of encoding detector tools, including
icu4j, in dealing with the HTML documents is meaningfully less than from which the other text
documents. Hence, in our project I developed a library that works pretty well for HTML documents,
which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch,
Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents,
it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message