tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Ott <alex...@gmail.com>
Subject Re: Charset detection
Date Wed, 09 Dec 2009 15:55:59 GMT
Hello

>From my experience, use of n-gram's for one-byte encodings works pretty
good for language/charset detection


2009/12/9 Jérôme Charron <jerome.charron@gmail.com>:
> Hi Antoni,
>
> I tried many charset detection libraries while working on Nutch but none of
> them was really working.
> I also tried to take a look at the mozilla charset detector , but it was
> really too complicated to integrate into Nutch (or Tika).
>
> Best regards
>
> Jérôme
>
> 2009/12/9 Antoni Mylka <antoni.mylka@gmail.com>
>
>> Aperturians, Tika
>>
>> I was wondering if anyone has any experience with the jchardet library
>> for charset detection. Does it work? What kinds of documents does it
>> actually support.
>>
>> Christiaan has posted an idea to the Aperture tracker how we could use
>> jchardet to improve the plain text extractor, but it doesn't seem to
>> work.  Or maybe the Tika guys have figured it out already and I can just
>> use Tika for this? :)
>>
>> Antoni Mylka
>> antoni.mylka@gmail.com
>>
>
>
>
> --
> Jérôme Charron
> Directeur Technique @ WebPulse
> Tel: +33675742890 <= ** NEW **
> eMail : jerome.charron@webpulse.fr
> http://www.webpulse.fr/
> http://www.shopreflex.com/
> http://www.staragora.com/
>



-- 
With best wishes,                    Alex Ott, MBA
http://alexott.blogspot.com/
http://alexott-ru.blogspot.com/
http://xtalk.msk.su/~ott/

Mime
View raw message