tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: AutoDetectParser is not parsing UTF-16 content types
Date Thu, 30 Aug 2012 21:09:02 GMT

On Aug 29, 2012, at 9:24am, Jukka Zitting wrote:

> Hi,
> 
> On Wed, Aug 29, 2012 at 6:02 PM, chraj007 <chraj.kool@gmail.com> wrote:
>> http://lucene.472066.n3.nabble.com/file/n4004078/test.html test.html
> 
> Looks like that file has an incorrect http-equiv declaration:
> 
>    <META http-equiv="Content-Type" content="text/html; charset=utf-16">
> 
> The encoding of the file is not UTF-16.
> 
> Can you file a TIKA issue about this? Tika should be able to
> automatically detect the correct encoding and use it if the declared
> one is obviously incorrect.

See https://issues.apache.org/jira/browse/TIKA-539 for an existing issue that discusses the
challenges of what information to trust with charset detection.

At the time of that issue, i was in favor of a heuristic that used server response/meta tags
as truth (if they agreed), otherwise fall back to statistical analysis.

But maybe statistical analysis is now fast/accurate enough, and we should only use the meta
tag as a hint for ICU.

-- Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message