tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ray Gauss II (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-965) Text Detection Fails on Mostly Non-ASCII UTF-8 Files
Date Wed, 01 Aug 2012 11:54:03 GMT

    [ https://issues.apache.org/jira/browse/TIKA-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426541#comment-13426541

Ray Gauss II commented on TIKA-965:

I have a test file that I've gotten permission to include: [http://svn.alfresco.com/repos/alfresco-open-mirror/alfresco/HEAD/root/projects/repository/source/test-resources/quick/quick.txt]

Other encodings/charsets is part of what I was trying to address with the {{Charset}} solution.
 If we add more {{CharsetRecognizer}} implementations we can easily plug those in to the {{TextDector}}
by adding them to {{VALID_TEXT_CHARSETS}}.

The charset detection only kicks in if magic detection has failed and {{TextDetector}} comes
up with {{isMostlyASCII=false}}, which should only be rare cases, so I don't think we need
to be too concerned with performance.

Here's what the relevant section in {{TextDetector}} ends up looking like:

if (stats.isMostlyAscii()) {
    return MediaType.TEXT_PLAIN;
} else {
    // Try detecting a valid text charset
    CharsetDetector charsetDetector = new CharsetDetector();
    CharsetMatch match = charsetDetector.detect();
    if (match != null && match.getConfidence() >= MINIMUM_CHARSET_MATCH_CONFIDENCE
            VALID_TEXT_CHARSETS.contains(match.getName())) {
        return MediaType.TEXT_PLAIN;
    return MediaType.OCTET_STREAM;

It seems simple enough, but I'm happy to pursue whatever solution people want.
> Text Detection Fails on Mostly Non-ASCII UTF-8 Files
> ----------------------------------------------------
>                 Key: TIKA-965
>                 URL: https://issues.apache.org/jira/browse/TIKA-965
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.2
>            Reporter: Ray Gauss II
>         Attachments: 0001-TIKA-965-Text-Detection-Fails-on-Mostly-Non-ASCII-UT.patch
> If a file contains relatively few ASCII characters and more 8 bit UTF-8 characters the
TextDetector and TextStatistics classes fail to detect it as text.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message