From "Shabanali Faghani (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents
Date Tue, 07 Feb 2017 23:43:41 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857043#comment-15857043

Shabanali Faghani commented on TIKA-2038:

bq. I recognize that the mime types returned by the server are not necessarily correct, but
this data might be useful.
Five years ago when I was a novice java developer I engaged with mime types for a while and
I know they are unreliable. Hence, I’m very concerned to use them for separating html documents.
In this regard I suggest “an arrow with two targets” (a Persian proverb)! It seems that
in this test the potential charset in meta headers is the only available thing that we can
use as “ground truth”. So, if we use the [HtmlEncodingDetector|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java]
class of Tika (with [META_TAG_BUFFER_SIZE |https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L42]
field that is set to Integer.MAX_VALUE), in addition to extract potential charsets from meta
headers, it implicitly will act as a html filter.

I think we must throw away documents with multiple charsets in meta headers (see TIKA-2050).
This way we can also get rid from rss/feed documents that their mime type is set to html (we
had some trouble with these documents in our project years ago). 

bq. If the goal is to get ~30k per tld, let's sample to obtain 50k on the theory that there
are duplicates and other reasons for failure.
I think it would be better to use the idea in [this post| https://issues.apache.org/jira/browse/TIKA-2038?focusedCommentId=15422448&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15422448]
for sampling. I will try to describe the idea in details in the next few days.

bq. Any other tlds or mime defs we should add?
I suggest to add *.mx* (Mexico), *.co* (Colombia), *.ar* (Argentina) in addition to *.es*
for Spanish (the 2nd ranked language by [native speakers| https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers]).
There is also no tld in your list for Portuguese, so I suggest to add *.br* (Brazil) and *.pt*
(Portugal). *.id* (Indonesia), *.my* (Malaysia), *.nl* (Netherlands), … are some other important

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, iust_encodings.zip,
lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, lang-wise-eval_source_code.zip, proposedTLDSampling.csv,
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as
the other naturally text documents. But the accuracy of encoding detector tools, including
icu4j, in dealing with the HTML documents is meaningfully less than from which the other text
documents. Hence, in our project I developed a library that works pretty well for HTML documents,
which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch,
Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents,
it seems that having such an facility in Tika also will help them to become more accurate.

