[ https://issues.apache.org/jira/browse/TIKA-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gaurav Gupta updated TIKA-2396:
-------------------------------
Attachment: test_Asset.txt
> Unexpected charset detected for a plain text file by CharsetDetector
> --------------------------------------------------------------------
>
> Key: TIKA-2396
> URL: https://issues.apache.org/jira/browse/TIKA-2396
> Project: Tika
> Issue Type: Bug
> Components: detector, parser
> Reporter: Gaurav Gupta
> Attachments: test_Asset.txt
>
>
> Hi,
> The CharsetDetector seems to be incorrectly detecting IBM424_rtl charset with maximum
probability for the text file attached - [^test_Asset.txt] . ISO-8859-9 has the second-best
confidence value which ideally should have first in the list.
> Versions being used:
> apache-core - 1.14.0
> apache-parsers-1.14.0
> Thanks
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
|