lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: verifying that an index contains ONLY utf-8
Date Thu, 13 Jan 2011 21:36:29 GMT
On Thu, Jan 13, 2011 at 2:05 PM, Jonathan Rochkind <rochkind@jhu.edu> wrote:
>
> There are various packages of such heuristic algorithms to guess char
> encoding, I wouldn't try to write my own. icu4j might include such an
> algorithm, not sure.
>

it does: http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html
this takes a sample of the file and makes a guess.

also, in general keep in mind that java CharsetDecoders tend to
silently replace or skip illegal chars, rather than throw exceptions.

If you want to instead be "paranoid" about these things, instead of
opening InputStreamReader with Charset,
open it with something like
charset.newDecoder().onMalformedInput(CodingErrorAction.REPORT).onUnmappableCharacter(CodingErrorAction.REPORT)

Then if the decoder ends up in some illegal state/byte sequence,
instead of silently replacing with U+FFFD, it will throw an exception.
Of course as Jonathan says, you cannot "confirm" that something is UTF-8.

But many times, you can "confirm" its definitely not: see
https://issues.apache.org/jira/browse/SOLR-2003 for an example
practical use of this, we throw
an exception if we can detect that your stopwords or synonyms file is
definitely wrongly-encoded.

Mime
View raw message