tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "gross (JIRA)" <j...@apache.org>
Subject [jira] Updated: (TIKA-574) Support for IBM866 (CP866) encoding in TXTParser
Date Thu, 16 Dec 2010 23:21:03 GMT

     [ https://issues.apache.org/jira/browse/TIKA-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

gross updated TIKA-574:
-----------------------

    Attachment: tika-0.8-cp866.patch

I've used ngrams from cp1251 and wrote custom byteMap. All russian letters, used in cp1251
are present in cp866, so no changes in NGrams needed.

Added inner static class in CharsetRecog_sbcs and CharsetDetector#createRecognizers modified
to register this class.


> Support for IBM866 (CP866) encoding in TXTParser
> ------------------------------------------------
>
>                 Key: TIKA-574
>                 URL: https://issues.apache.org/jira/browse/TIKA-574
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.8
>         Environment: GNU/Linux 2.6.35-23, openjdk6
>            Reporter: gross
>            Priority: Minor
>             Fix For: 0.8, 0.9, 1.0
>
>         Attachments: tika-0.8-cp866.patch
>
>
> There's no recognizer for CP866 (DOS russian encoding) in tika yet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message