tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shabanali Faghani (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents
Date Thu, 04 Aug 2016 10:09:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407514#comment-15407514
] 

Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

No. Maybe you’ve got the answer of this question by reading my recent comment, anyways for
more clarifications…

*First corpus:* Since there was not any benchmark in this context, I’ve wrote a simple multi-threaded
crawler to collect a fairly small one. I’ve used charset information that are available
for almost half of the html pages in the HTTP header as validity measure. In fact the crawled
pages that had charset information in their HTTP header were categorized in *corpus* directory
by this information as subdirectory, e.g. GBK, Windows-1251, etc. (almost half of the all
requested pages by my crawler), the other half were just simply ignored. Since, almost all
html pages that HTTP servers provide clients with the information about their charset also
have charset information in their Meta tags, almost all docs in the first corpus have this
information, though these two information are not necessarily the same!

*Second corpus:* There is no second corpus as you think. That is just a collection of 148,297
URLs extracted from Alexa top 1 million sites by using [Top Level Domain (TLD)|https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains]
names as the criteria for 8 languages. These URLs are available [here|https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/test-data/language-wise]
(last 8 files, not directories). Again in this evaluation we used charset information in HTTP
header as the validity measure/ground truth and since this information was available only
for 85,292 URLs, the rest were ignored.
Some points…
* The actual URLs count that had charset information in HTTP header was greater than 85,292
but for the sake of various networking problems some of them were failed in fetching
* We didn’t persist these 85,292 pages, because we didn’t need to them anymore after the
test and I think their estimated aggregate size was at least ~1.7 GIG (85,292 * 20 KB = 1,706
GB).

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, iust_encodings.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as
the other naturally text documents. But the accuracy of encoding detector tools, including
icu4j, in dealing with the HTML documents is meaningfully less than from which the other text
documents. Hence, in our project I developed a library that works pretty well for HTML documents,
which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch,
Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents,
it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message