tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1050) Charset detection gives wrong results for GB18030 encoding
Date Fri, 01 Feb 2013 00:13:12 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13568267#comment-13568267
] 

Nick Burch commented on TIKA-1050:
----------------------------------

Charset detection generally works best if you give it a few kb of data to work on - it's all
statistics based (n-grams), and a very short snippet generally isn't representative

Do you have the same problem with a slightly longer block of text? If so, any chance you could
upload a new sample file that's something like 2-3kb that we could use to test with?
                
> Charset detection gives wrong results for GB18030 encoding
> ----------------------------------------------------------
>
>                 Key: TIKA-1050
>                 URL: https://issues.apache.org/jira/browse/TIKA-1050
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>            Reporter: Amit Gupta
>            Priority: Critical
>         Attachments: Test data-GB.txt
>
>
> CharsetDetector gives IBM866 as the charset for text file that is in GB18030.
> GB18030 gets a lower confidence than IBM866.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message