tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jérôme Charron <jerome.char...@gmail.com>
Subject Re: Charset detection
Date Wed, 09 Dec 2009 15:50:48 GMT
Hi Antoni,

I tried many charset detection libraries while working on Nutch but none of
them was really working.
I also tried to take a look at the mozilla charset detector , but it was
really too complicated to integrate into Nutch (or Tika).

Best regards


2009/12/9 Antoni Mylka <antoni.mylka@gmail.com>

> Aperturians, Tika
> I was wondering if anyone has any experience with the jchardet library
> for charset detection. Does it work? What kinds of documents does it
> actually support.
> Christiaan has posted an idea to the Aperture tracker how we could use
> jchardet to improve the plain text extractor, but it doesn't seem to
> work.  Or maybe the Tika guys have figured it out already and I can just
> use Tika for this? :)
> Antoni Mylka
> antoni.mylka@gmail.com

Jérôme Charron
Directeur Technique @ WebPulse
Tel: +33675742890 <= ** NEW **
eMail : jerome.charron@webpulse.fr

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message