tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents
Date Tue, 07 Feb 2017 13:41:42 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tim Allison updated TIKA-2038:
------------------------------
    Attachment: proposedTLDSampling.csv

I concatenated the tlds from your initial eval (github) with ones you mention in the last
post, and I added a few others for kicks.

If the goal is to get ~30k per tld, let's sample to obtain 50k on the theory that there are
duplicates and other reasons for failure. 

Any other tlds or mime defs we should add?

SQL to calculate these:
{noformat}
select tld, sum(n) as CountTextHTML, 
     case 
          when cast(50000 as float)/cast(sum(n) as float) > 1.0
          then 1.0
          else cast(50000 as float)/cast(sum(n) as float)
     end SamplingRate

from mimes_by_tld 
where tld in
('ae', 'af', 'cn', 'de', 'dz',
'eg', 'es', 'fr', 'gr', 'il',
'in', 'iq', 'ir', 'it', 'jo', 'jp',
'kp', 'kr', 'lb', 'pk', 'qa', 'ru',
'sa', 'sd', 'sy', 'tn', 'tr',
'tw', 'uk', 'us', 'vn', 'ye')
and 
(mime ilike '%html%'
or mime ilike '%text%')
group by tld
{noformat} 

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, iust_encodings.zip,
lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, lang-wise-eval_source_code.zip, proposedTLDSampling.csv,
tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as
the other naturally text documents. But the accuracy of encoding detector tools, including
icu4j, in dealing with the HTML documents is meaningfully less than from which the other text
documents. Hence, in our project I developed a library that works pretty well for HTML documents,
which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch,
Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents,
it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message