tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shabanali Faghani (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents
Date Sat, 04 Mar 2017 18:32:45 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15887801#comment-15887801
] 

Shabanali Faghani edited comment on TIKA-2038 at 3/4/17 6:31 PM:
-----------------------------------------------------------------

Perfect reply, [~tallison@mitre.org]. Thank you!
 
bq. The current version of the stripper leaves in <meta > headers if they also include
"charset". … I included the output of the stripped HTMLMeta detector as a sanity check …
(/)
 
bq. I figure that we'll be modifying the stripper …
 
We might need the stripper works like a SAX parser, i.e the input should be _InputStream_.
This is required if we decided to be too conservative about OOM error or avoiding from resource
wasting for big html files. I know writing a perfect _html stream stripper_ with the minimal
faults (false-negative/positive, exception, …) is very hard. As a SAX parser, TagSoup should
be able to to do so but there are two problems including _chicken and egg_ and _performance_.
The former problem can be solved by _ISO-8859-1 encoding-decoding_ trick but there is no solution
for the latter.

For a lightweight SAX-style stripper I think we can ask [Jonathan Hedley| https://jhy.io/],
the author of Jsoup or someone else in Jsoup’s mailing list that if they ever have done
a thing like this or could they help us. We may also suggest/introduce IUST (the standalone
version) to them. IIRC, in Jsoup 1.6.1-3 (and most likely now) the charset of a page was supposed/considered
as UTF-8 if the http header didn’t contain any charset or the charset was not specified
in input.
 
bq. … and possibly IUST.
 
The current version of IUST, i.e htmlchardet-1.0.1, uses _early-termination_ for neither JCharDet
nor ICU4j! So, we should write a custom version of IUST to do so. Nevertheless, I think we
can ignore this for the first version because I think that haven’t a meaningful effect on
the algorithm. In fact I think calling the detection methods of JCharDet and ICU4j with InputStream
input will a bit increase the efficiency in charge of a bit decrease in the accuracy.
 
bq. I didn't use IUST because this was a preliminary run, and I wasn't sure which version
I should use. The one on github or the proposed modification above or both? Let me know which
code you'd like me to run.
 
The _modified IUST_ isn’t yet complete. To complete it we must prepare a thorough list of
languages for which the stripping shouldn’t be done.  These languages/tlds are determined
by comparing the results of the IUST with and without stripping. So, you should run both _htmlchardet-1.0.1.jar_
(IUST whit stripping) with _lookInMeta=false_ and the class _IUSTWithoutMarkupElimination_
(IUST without stripping) from the [lang-wise-eval source code| https://issues.apache.org/jira/secure/attachment/12848364/lang-wise-eval_source_code.zip].
The accuracy of _modified IUST_ (the pseudo code above) can be computed algorithmically by
selecting the best from the two for each language/tld .
 
bq. I want to focus on accuracy first. We still have to settle on an eval method. But, yes,
I do want to look at this. (/)


was (Author: faghani):
Perfect reply, [~tallison@mitre.org]. Thank you!
 
bq. The current version of the stripper leaves in <meta > headers if they also include
"charset". … I included the output of the stripped HTMLMeta detector as a sanity check …
(/)
 
bq. I figure that we'll be modifying the stripper …
 
We might need the stripper works like a SAX parser, i.e the input should be _InputStream_.
This is required if we decided to be too conservative about OOM error or avoiding from resource
wasting for big html files. I know writing a perfect _html stream stripper_ with the minimal
faults (false-negative/positive, exception, …) is very hard. As a SAX parser, TagSoup should
be able to to do so but there are two problems including _chicken and egg_ and _performance_.
The former problem can be solved by _ISO-8859-1 encoding-decoding_ trick but there is no solution
for the latter.

For a lightweight SAX-style stripper I think we can ask [Jonathan Hedley| https://jhy.io/],
the author of Jsoup or someone else in Jsoup’s mailing list that if they ever have done
a thing like this or could they help us. We may also suggest/introduce IUST (the standalone
version) to them. This is quite like a gif entitled “_Adding a citation to a paper possibly
written by the reviewer_” in [phd funnies| http://users.auth.gr/ksiop/phd_funny/index.html],
mutual scratching!! IIRC, in Jsoup 1.6.1-3 (and most likely now) the charset of a page was
supposed/considered as UTF-8 if the http header didn’t contain any charset or the charset
was not specified in input.
 
bq. … and possibly IUST.
 
The current version of IUST, i.e htmlchardet-1.0.1, uses _early-termination_ for neither JCharDet
nor ICU4j! So, we should write a custom version of IUST to do so. Oh, still a lot of works
to do … :( Nevertheless, I think we can ignore this for the first version because I think
that haven’t a meaningful effect on the algorithm. In fact I think calling the detection
methods of JCharDet and ICU4j with InputStream input will a bit increase the efficiency in
charge of a bit decrease in the accuracy.
 
bq. I didn't use IUST because this was a preliminary run, and I wasn't sure which version
I should use. The one on github or the proposed modification above or both? Let me know which
code you'd like me to run.
 
The _modified IUST_ isn’t yet complete. To complete it we must prepare a thorough list of
languages for which the stripping shouldn’t be done.  These languages/tlds are determined
by comparing the results of the IUST with and without stripping. So, you should run both _htmlchardet-1.0.1.jar_
(IUST whit stripping) with _lookInMeta=false_ and the class _IUSTWithoutMarkupElimination_
(IUST without stripping) from the [lang-wise-eval source code| https://issues.apache.org/jira/secure/attachment/12848364/lang-wise-eval_source_code.zip].
The accuracy of _modified IUST_ (the pseudo code above) can be computed algorithmically by
selecting the best from the two for each language/tld .
 
bq. I want to focus on accuracy first. We still have to settle on an eval method. But, yes,
I do want to look at this. (/)

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, iust_encodings.zip,
lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, lang-wise-eval_source_code.zip, proposedTLDSampling.csv,
tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html_plus_H_column.xlsx, tld_text_html.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as
the other naturally text documents. But the accuracy of encoding detector tools, including
icu4j, in dealing with the HTML documents is meaningfully less than from which the other text
documents. Hence, in our project I developed a library that works pretty well for HTML documents,
which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch,
Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents,
it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message