tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-965) Text Detection Fails on Mostly Non-ASCII UTF-8 Files
Date Wed, 01 Aug 2012 11:31:03 GMT

     [ https://issues.apache.org/jira/browse/TIKA-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jukka Zitting updated TIKA-965:

    Attachment: 0001-TIKA-965-Text-Detection-Fails-on-Mostly-Non-ASCII-UT.patch

The attached patch implements the above idea. It seems to work fine with the UTF-8 demo in
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt, though I don't know if we can
include that file in Tika as a test case. Old texts from China, Middle-East or other non-Latin
areas of the world might be a good source of copyright-free test data.

bq. Are we likely to run into similar issues with other encodings besides UTF-8?

Probably, though I think the best way to deal with them is case-by-case based on concrete
issues people face. AFAICT there's no generic solution to this problem.
> Text Detection Fails on Mostly Non-ASCII UTF-8 Files
> ----------------------------------------------------
>                 Key: TIKA-965
>                 URL: https://issues.apache.org/jira/browse/TIKA-965
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.2
>            Reporter: Ray Gauss II
>         Attachments: 0001-TIKA-965-Text-Detection-Fails-on-Mostly-Non-ASCII-UT.patch
> If a file contains relatively few ASCII characters and more 8 bit UTF-8 characters the
TextDetector and TextStatistics classes fail to detect it as text.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message