tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shabanali Faghani (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents
Date Mon, 01 Aug 2016 08:40:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401715#comment-15401715
] 

Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

OK, so to give more details about my library to this community and also in response to your
concerns I would to say:

1) You are right, my repo on github is fairly new (less than 1 year) but its algorithm is
not new. I developed this library 4 years ago in order to be used in a large-scale project…
and it works well from that time till now. It was under a load of ~1.2 billion pages in the
peak time. The bug that I’ve fixed in last week was just a tiny mistake that occurred during
refactoring the code before the first release.

2) Since the accuracy was much more important than performance for us, I haven’t done a
thorough performance test. Nevertheless, bellow I’ve provided the results of a small test
that was done on my laptop (intel core i3, java 6, Xmx: default (don’t care)):
||Subdirectory||#docs||Total Size (KB)||Average Size (KB)||Detection Time (millisecond)||Average
Time (millisecond)||
|UTF-8|657|32,216|49|26658|40|
|Windows-1251|314|30,941|99|4423|14|
|GBK|419|43,374|104|20317|48|
|Windows-1256|645|66,592|103|9451|14|
|Shift_JIS|640|25,973|41|7617|11|

Let’s have a little bit more precise look at these results. Due to the logic of my algorithm,
for the first row of this table, i.e UTF-8, only Mozilla JCharDet was used (no JSoup and no
ICU4J). But as you can see from this table the required time is greater than the three other
cases for which documents were parsed using JSoup and also the both JCharDet and ICU4J were
involved in detection prrocess. It means that if the encoding of a page is UTF-8 the required
time for a positive response from Mozilla JChardet is often greater than the required time
for … 
* get a negative response from Mozilla JChardet +
* decode the input byte array using “ISO-8859-1”+
* parse that doc and creating DOM tree +
* extract text from DOM tree +
* encode the extracted text using “ISO-8859-1” +
* and detecting its encoding by using icu4j  
… when the encoding of a page is not UTF-8!! In brief, 40 > 14, 11 , … in the above
table. 

Now let’s have a look at the [distribution of character encodings for websites| https://w3techs.com/technologies/history_overview/character_encoding].
Since ~87% of all websites use UTF-8, if we use a statistical computation for computing the
weighted average time for detecting encoding of a custom html document, I think we would get
a similar estimate for both IUST-HTMLCharDet and Tika-EncodingDetector. Because, this estimate
is strongly biased by Mozilla JCharDet and as we know this tool is used in the both algorithms
in a similar way. Nevertheless, for performance optimizations I will do some tests for …
* using a Regex instead of navigating in DOM tree for seeking charsets in Meta tags
* stripping HTML Markups, Scripts and embedded CSSs directly instead of using a html parser

3) For computing the accuracy of Tika's legacy method I’ve provided a comment below your
current evaluation results. As I’ve explained there, the results of your current evaluation
couldn’t be compared with my evaluation.

bq. Perhaps we could add some code to do that?
Of course, but from experience when I use open sources in my projects, due to the versioning
and updating considerations I don’t shift my code into them unless there wouldn’t be any
other suitable solution/option. But pulling other code into my projects is *another story*!
:)

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as
the other naturally text documents. But the accuracy of encoding detector tools, including
icu4j, in dealing with the HTML documents is meaningfully less than from which the other text
documents. Hence, in our project I developed a library that works pretty well for HTML documents,
which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch,
Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents,
it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message