lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <>
Subject Re: [VOTE] Apache Tika 0.4 Release Candidate 2
Date Wed, 15 Jul 2009 13:30:56 GMT

On Wed, Jul 15, 2009 at 3:00 PM, Grant Ingersoll<> wrote:
> 3. Did something change such that CONTENT_LANGUAGE is now not being set for
> HTML?  We have a test in Solr that looks for that attribute, and it was
> passing with 0.3 but is now not passing in 0.4.

This is because of TIKA-208.

We used to use the ICU4J charset detection mechanism to automatically
detect the encoding of HTML files. ICU4J would also guess the content
language based on the detected encoding (e.g. a document encoded in
KOI8-R is most likely written in Russian).

However, this mechanism wasn't as accurate as the encoding detection
already present in NekoHtml and language detection based on just the
encoding is often incorrect.

See TIKA-209 for some ideas on how to make the language detection more
generic and accurate. For now I think it's better to ship Tika 0.4
without the earlier flawed CONTENT_LANGUAGE implementation for HTML.


Jukka Zitting

View raw message