tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shabanali Faghani (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents
Date Fri, 10 Feb 2017 09:16:41 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860415#comment-15860415
] 

Shabanali Faghani edited comment on TIKA-2038 at 2/10/17 9:15 AM:
------------------------------------------------------------------

Attached, the H column is a naive implementation of the idea I’ve proposed before. _Starvation_
and _Malnutrition_ are quite obvious for some tlds in this column but altogether it properly
reflects distribution of the html documents for selected tlds in Common Crawl. 


Although it’s possible to relieve the problems of this sampling algorithm but I think that
isn’t so important, because as I’ve seen in my evaluations, the accuracy of each detector
algorithm was converged to a specific number after processing just a few portion of each tld.
So, I think selecting either method (mine or yours) for sampling won't have a meaningful effect
on the results, however will a bit affect on the weighted aggregated results (see the + and
* group bars in the coarse-grained result diagram of the lang-wise-eval attached files).

bq. Let's put off talk about metaheaders and evaluation until we gather the data.

Ok.

bq. I added your the codes you added above and a few others. How does this look?

Looks fine to me.


was (Author: faghani):
Attached, the H column is a naive implementation of the idea I’ve proposed before. _Starvation_
and _Malnutrition_ are quite obvious for some tlds in this column but altogether that properly
reflects distribution of the selected tlds in Common Crawl. 


Although it’s possible to relieve the problems of this sampling but I think that isn’t
so important, because as I’ve seen in my evaluations, after just a few percent of each tld
got processed the accuracy of the all detector algorithms got converged. So, I think selecting
either method (mine or yours) for sampling won't have a meaningful effect on the results,
however will a bit affect on the weighted aggregated results (see + and * group bars in the
coarse-grained result of the lang-wise-eval attached files).

bq. Let's put off talk about metaheaders and evaluation until we gather the data.

Ok.

bq. I added your the codes you added above and a few others. How does this look?

Looks fine to me, at least at this stage.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, iust_encodings.zip,
lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, lang-wise-eval_source_code.zip, proposedTLDSampling.csv,
tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html_plus_H_column.xlsx, tld_text_html.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as
the other naturally text documents. But the accuracy of encoding detector tools, including
icu4j, in dealing with the HTML documents is meaningfully less than from which the other text
documents. Hence, in our project I developed a library that works pretty well for HTML documents,
which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch,
Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents,
it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message