tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (Jira)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2940) Consider an ensemble charset detection method
Date Mon, 09 Sep 2019 11:04:00 GMT
Tim Allison created TIKA-2940:
---------------------------------

             Summary: Consider an ensemble charset detection method
                 Key: TIKA-2940
                 URL: https://issues.apache.org/jira/browse/TIKA-2940
             Project: Tika
          Issue Type: Improvement
            Reporter: Tim Allison


I recently ran our four charset detectors against our text based files.

The raw data is available here:
http://162.242.228.174/encoding_detection/charsets_combined_201909.sql.zip (in sql form) or
http://162.242.228.174/encoding_detection/charsets_combined_201909.csv.zip (in a csv).

I've posted a preliminary/draft report here: https://github.com/tballison/share/blob/master/slides/Tika_charset_detector_study_201909.docx

In general, we could see a ~1.4% improvement in "common tokens"[0] if we used an ensemble
approach _on our corpus_.  For users with more homogeneous documents, this improvement could
be far greater (e.g. if their documents _all_ come from a content management system that is
applying an incorrect html-meta charset header).

I'm opening this issue for discussion and as encouragement for others to work with the raw
data and/or make recommendations on the preliminary report's methodology.

[0] "common tokens" in tika-eval refers to the lists we developed of the top 30k most common
words per 118 languages covered in tika-eval.  It can be a sign of improved extraction if
the total number of "common tokens" increases.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Mime
View raw message