tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: AutoDetectParser is not parsing UTF-16 content types
Date Wed, 29 Aug 2012 16:24:32 GMT

On Wed, Aug 29, 2012 at 6:02 PM, chraj007 <chraj.kool@gmail.com> wrote:
> http://lucene.472066.n3.nabble.com/file/n4004078/test.html test.html

Looks like that file has an incorrect http-equiv declaration:

    <META http-equiv="Content-Type" content="text/html; charset=utf-16">

The encoding of the file is not UTF-16.

Can you file a TIKA issue about this? Tika should be able to
automatically detect the correct encoding and use it if the declared
one is obviously incorrect.


Jukka Zitting

View raw message