tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Re: [Aperture-devel] Charset detection
Date Wed, 09 Dec 2009 16:50:12 GMT
I've had reasonable success with the ICU charset
detection, but that's the only one I've tried and
so can't compare it to any other.

--Thilo

On 12/9/2009 17:10, darren@ontrenet.com wrote:
> Yeah, there are many indefinites with regards to charset detection and
> there is no 100% accurate method of interpreting the charset. Its more art
> than science. That said, I will hunt around for a decent library too.
> 
>> Hi Antoni,
>>
>> I tried many charset detection libraries while working on Nutch but none
>> of
>> them was really working.
>> I also tried to take a look at the mozilla charset detector , but it was
>> really too complicated to integrate into Nutch (or Tika).
>>
>> Best regards
>>
>> Jérôme
>>
>> 2009/12/9 Antoni Mylka <antoni.mylka@gmail.com>
>>
>>> Aperturians, Tika
>>>
>>> I was wondering if anyone has any experience with the jchardet library
>>> for charset detection. Does it work? What kinds of documents does it
>>> actually support.
>>>
>>> Christiaan has posted an idea to the Aperture tracker how we could use
>>> jchardet to improve the plain text extractor, but it doesn't seem to
>>> work.  Or maybe the Tika guys have figured it out already and I can just
>>> use Tika for this? :)
>>>
>>> Antoni Mylka
>>> antoni.mylka@gmail.com
>>>
>>
>>
>>
>> --
>> Jérôme Charron
>> Directeur Technique @ WebPulse
>> Tel: +33675742890 <= ** NEW **
>> eMail : jerome.charron@webpulse.fr
>> http://www.webpulse.fr/
>> http://www.shopreflex.com/
>> http://www.staragora.com/
>> ------------------------------------------------------------------------------
>> Return on Information:
>> Google Enterprise Search pays you back
>> Get the facts.
>> http://p.sf.net/sfu/google-dev2dev
>> _______________________________________________
>> Aperture-devel mailing list
>> Aperture-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>>

Mime
View raw message