lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gereon Steffens" <>
Subject UTF-8 2-byte vs 4-byte encodings
Date Wed, 02 May 2007 07:58:47 GMT

I have a question regarding UTF-8 encodings, illustrated by the
utf8-example.xml file. This file contains raw, unescaped UTF8 characters,
for example the "e acute" character, represented as two bytes 0xC3 0xA9.
When this file is added to Solar and retrieved later, the XML output
contains a four-byte representation of that character, namely 0xC2 0x83
0xC2 0xA9.

If, on the other hand, the input data contains this same character as an
entity &#A9; the output contains the two-byte encoded representation 0xC3

Why is that so, and is there a way to always get characters like these out
of Solr as their two-byte representations?

The reason I'm asking is that I often have to deal with CDATA sections in
my input files that contain raw (two-byte) UTF8 characters that can't be
encoded as entities.


View raw message