tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christiaan Fluit <christiaan.fl...@aduna-software.com>
Subject Re: [Aperture-devel] Charset detection
Date Wed, 09 Dec 2009 20:33:26 GMT
Antoni Mylka wrote:
> I was wondering if anyone has any experience with the jchardet library
> for charset detection. Does it work? What kinds of documents does it
> actually support.
> Christiaan has posted an idea to the Aperture tracker how we could use
> jchardet to improve the plain text extractor, but it doesn't seem to
> work.  Or maybe the Tika guys have figured it out already and I can just
> use Tika for this? :)

We started using jchardet in conjunction with cpdetector to better 
support Chinese, Japanese and Korean documents in our app on all Windows 
language variants. Else it would need to fall back to the default 
platform encoding or a user setting when a UTF Byte Order Mark was 
missing. It seemed to do a pretty good job on the test files that I used 
(primarily CJK and English docs). Only recently we found out that 
jchardet doesn't detect Cyrillic documents.

It seems that the set of supported charsets in jchardet is a subset of 
those supported by Mozilla/Firefox (jcharset is supposed to be a Java 
port of the charset detection algorithm in those apps). As additional 
charsets are a matter of porting some static data structures encoded in 
C or C++ to Java, perhaps it's feasible to do that ourselves? Provided 
that the algorithm hasn't changed of course. I did not have any contact 
with any of the jchardet developers yet.

When testing the Aperture test docs, only plain-text-utf16le.txt does 
not get processed correctly anymore, correct? This is a cpdetector 
problem, not a jcharset problem. We already have solid code (IMHO :) ) 
for BOM detection in our existing PlainTextExtractor, no need to use 
cpdetector's ByteOrderMarkDetector.



View raw message