lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <>
Subject Re: Cyrillic characters
Date Tue, 18 Jul 2006 23:16:47 GMT
OK, lets split up the indexing side from the query side for a moment
and assume that you are indexing correctly (setting the content-type
correctly, etc).

I just added a new value to the multi-valued features field to the
solr.xml example document:
  "Good unicode support: héllo (hello with an accent over the e)"
or in the XML:
  <field name="features">Good unicode support: h&#xE9;llo (hello with
an accent over the e)</field>

I used a numeric entity because doesn't specify any
content-type (ascii or latin1 may be assumed).  But as I said, let's
assume things are indexed correctly for now.

The URI standard says the following:
'''When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be
percent-encoded. For example, the character A would be represented as
"A", the character LATIN CAPITAL LETTER A WITH GRAVE would be
represented as "%C3%80", and the character KATAKANA LETTER A would be
represented as "%E3%82%A2".'''

So, the unicode code point for the e with an accute accent is \u00E9.
In UTF8 encoding it's a two byte sequence: 0xc3,0xa9

In both Firefox and IE, the following URI works fine to find the document:

If I try pasting héllo from notepad directly into the URL, IE works
fine, but Firefox substitutes the accented e with %E9, which is

I haven't tried more complicated examples yet, and I haven't tried
wget, etc, but things look like they are working as expected so far
(with the exception of a firefox bug).


View raw message