tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gaurav Gupta (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2396) Unexpected charset detected for a plain text file by CharsetDetector
Date Fri, 16 Jun 2017 10:27:00 GMT
Gaurav Gupta created TIKA-2396:
----------------------------------

             Summary: Unexpected charset detected for a plain text file by CharsetDetector
                 Key: TIKA-2396
                 URL: https://issues.apache.org/jira/browse/TIKA-2396
             Project: Tika
          Issue Type: Bug
          Components: detector, parser
            Reporter: Gaurav Gupta


Hi,

The CharsetDetector seems to be incorrectly detecting IBM424_rtl charset with maximum probability
for the text file attached - [^test_Asset.txt] . ISO-8859-9 has the second-best confidence
value which ideally should have first in the list.
Versions being used:

apache-core - 1.14.0
apache-parsers-1.14.0

Thanks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message