From dev-return-8281-apmail-tika-dev-archive=tika.apache.org@tika.apache.org Wed Aug 1 10:14:08 2012 Return-Path: X-Original-To: apmail-tika-dev-archive@www.apache.org Delivered-To: apmail-tika-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2E2DADBDD for ; Wed, 1 Aug 2012 10:14:08 +0000 (UTC) Received: (qmail 51151 invoked by uid 500); 1 Aug 2012 10:14:07 -0000 Delivered-To: apmail-tika-dev-archive@tika.apache.org Received: (qmail 50968 invoked by uid 500); 1 Aug 2012 10:14:06 -0000 Mailing-List: contact dev-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list dev@tika.apache.org Received: (qmail 50881 invoked by uid 99); 1 Aug 2012 10:14:04 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Aug 2012 10:14:04 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id D000714285A for ; Wed, 1 Aug 2012 10:14:03 +0000 (UTC) Date: Wed, 1 Aug 2012 10:14:03 +0000 (UTC) From: "Jukka Zitting (JIRA)" To: dev@tika.apache.org Message-ID: <474556895.59.1343816043853.JavaMail.jiratomcat@issues-vm> In-Reply-To: <1942804623.120986.1343735434536.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Comment Edited] (TIKA-965) Text Detection Fails on Mostly Non-ASCII UTF-8 Files MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/TIKA-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426490#comment-13426490 ] Jukka Zitting edited comment on TIKA-965 at 8/1/12 10:12 AM: ------------------------------------------------------------- I'm not too big a fan of the {{Charset}} classes in {{o.a.t.parser.txt}}. We borrowed them from ICU4J, and though they cover a lot of exotic corner cases, they're pretty slow and cumbersome to use with the vast majority of text out there. An alternative that should work fairly well is to leverage the existing {{TextStatistics}} class in {{tika-core}} for a quick check of whether there are as many UTF-8 continuation bytes in the text as there should be. Something like the following might be a good approximation: {code} public boolean looksLikeUTF8() { int control = count(0, 0x20); int utf8 = count(0x20, 0x80); int safe = countSafeControl(); int expectedContinuation = 0; int[] leading = new int[] { count(0xc0, 0xe0), count(0xe0, 0xf0), count(0xf0, 0xf8) }; for (int i = 0; i < leading.length; i++) { utf8 += leading[i]; expectedContinuation += (i + 1) * leading[i]; } int continuation = count(0x80, 0xc0); return utf8 > 0 && continuation <= expectedContinuation && continuation >= expectedContinuation - 3 && count(0xf8, 0x100) == 0 && (control - safe) * 100 < utf8 * 2; } {code} was (Author: jukkaz): I'm not too big a fan of the {{Charset}} classes in {{o.a.t.parser.txt}}. We borrowed them from ICU4J, and though they cover a lot of exotic corner cases, they're pretty slow and cumbersome to use with the vast majority of text out there. An alternative that should work fairly well is to leverage the existing {{TextStatistics}} class in {{tika-core}} for a quick check of whether there are as many UTF-8 continuation bytes in the text as there should be. Something like the following might be a good approximation: {code} public boolean looksLikeUTF8() { int control = count(0, 0x20); int utf8 = count(0x20, 0x80); int safe = countSafeControl(); int expectedContinuation = 0; int[] leading = new int[] { count(0xc0, 0xe0), count(0xe0, 0xf0), count(0xf0, 0xf8) }; for (int i = 0; i < leading.length; i++) { utf8 += leading[i]; expectedContinuation += (i + 1) * leading[i]; } int continuation = count(0x80, 0xc0); return utf8 > 0 && continuation <= expectedContinuation && continuation >= expectedContinuation - 3 && count(0xf80, 0x100) == 0 && (control - safe) * 100 < utf8 * 2; } {code} > Text Detection Fails on Mostly Non-ASCII UTF-8 Files > ---------------------------------------------------- > > Key: TIKA-965 > URL: https://issues.apache.org/jira/browse/TIKA-965 > Project: Tika > Issue Type: Bug > Components: general > Affects Versions: 1.2 > Reporter: Ray Gauss II > > If a file contains relatively few ASCII characters and more 8 bit UTF-8 characters the TextDetector and TextStatistics classes fail to detect it as text. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira