[ https://issues.apache.org/jira/browse/TIKA-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13568267#comment-13568267
]
Nick Burch commented on TIKA-1050:
----------------------------------
Charset detection generally works best if you give it a few kb of data to work on - it's all
statistics based (n-grams), and a very short snippet generally isn't representative
Do you have the same problem with a slightly longer block of text? If so, any chance you could
upload a new sample file that's something like 2-3kb that we could use to test with?
> Charset detection gives wrong results for GB18030 encoding
> ----------------------------------------------------------
>
> Key: TIKA-1050
> URL: https://issues.apache.org/jira/browse/TIKA-1050
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.2
> Reporter: Amit Gupta
> Priority: Critical
> Attachments: Test data-GB.txt
>
>
> CharsetDetector gives IBM866 as the charset for text file that is in GB18030.
> GB18030 gets a lower confidence than IBM866.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
|