Gaurav Gupta created TIKA-2396:
----------------------------------
Summary: Unexpected charset detected for a plain text file by CharsetDetector
Key: TIKA-2396
URL: https://issues.apache.org/jira/browse/TIKA-2396
Project: Tika
Issue Type: Bug
Components: detector, parser
Reporter: Gaurav Gupta
Hi,
The CharsetDetector seems to be incorrectly detecting IBM424_rtl charset with maximum probability
for the text file attached - [^test_Asset.txt] . ISO-8859-9 has the second-best confidence
value which ideally should have first in the list.
Versions being used:
apache-core - 1.14.0
apache-parsers-1.14.0
Thanks
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
|