tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents
Date Wed, 08 Feb 2017 12:29:41 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tim Allison updated TIKA-2038:
------------------------------
    Attachment: tld_text_html.xlsx

bq. Since it seems that in this test the potential charset in meta headers is the only available
thing that we can use as “ground truth”, if we use the HtmlEncodingDetector class of Tika
(with META_TAG_BUFFER_SIZE field that is set to Integer.MAX_VALUE), in addition to extract
potential charsets from meta headers, it implicitly will act as a html filter.

In the above sql/proposal, the mime is what was returned in the actual http headers, as recorded
by CommonCrawl.  They are still somewhat noisy.  Let's put off talk about metaheaders and
evaluation until we gather the data.

In the attached, I applied a "dominant" language code to each country.  For countries with
multiple "dominant" languages, I used the country code ("in" -> "in").  This is a very
rough attempt to get decent coverage of languages.  I then calculate how many pages from each
country we'd want to collect to get roughly 50k per language.

I added your country codes and a few others.  How does this look?


> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, iust_encodings.zip,
lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, lang-wise-eval_source_code.zip, proposedTLDSampling.csv,
tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as
the other naturally text documents. But the accuracy of encoding detector tools, including
icu4j, in dealing with the HTML documents is meaningfully less than from which the other text
documents. Hence, in our project I developed a library that works pretty well for HTML documents,
which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch,
Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents,
it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message