lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <>
Subject Re: UTF-8 indexing and searching
Date Fri, 01 Jul 2005 20:57:59 GMT
Careful that in the http world, there's an amibuity: 
x-www-form-url-encoded does not specify the content-encoding that the 
byts represented in the %-escaped sequences are written with.
That's fixed by the very recent URI spec where absence means utf-8...

My experience was that Tomcat simply converted the bytes of this into 
the first bytes of the 16-bit unicode, therefore working with 
We succeeded receiving forms from pages utf-8-encded by packing an 
inputstreamreader in utf-8 at the end of an inputstream that reads the 
bytes of the string of request.getParam...

Hope that helps.


Le 1 juil. 05, à 22:41, <> a écrit :

> Did you check that the request string you get at the analyzer
> level is corectly encoded as UTF-8?
> We had the same problem with french accentuated char encoded
> also as UTF-8, and transmited by tomcat as ISO-8859-1. It was
> just for a test, also we didn't investgated a lot, but
> re-encode in URL/ISO-8859-1 and re-decode from URL in correct
> UTF-8, and it worked.
> Don't know, if it may help you ...
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message