lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: International Charsets in embedded XML
Date Tue, 13 Jun 2006 17:08:44 GMT
>Klaas-2 wrote:
>>
>>  Are you sending Content-Type headers with appropriate charset
>>  indicated?  Is your xml fully-escpaed in your update message?
>>
>
>...no, actually I simply make a
>
>			URLConnection conn = url.openConnection();
>			conn.setRequestProperty("ContentType", "text/xml");
>			conn.setDoOutput(true);
>			wr = new OutputStreamWriter(conn.getOutputStream());
>			wr.write(data);
>			wr.flush();
>
>to post del add xml and my XML is embedded in a CData without further
>escaping... have I to to something else.
>
>I'm getting data from a MySQL db and I found some problems where in
>retrieving data from there.
>
>I've made some step forword connecting to the db with
>"characterEncodingutf8" in the jdbc URL, and then converting with:
>
>new String(mysqlXMLField.getBytes("latin1"));

If you use "characterEncodingutf8", then I think you'll get back a 
stream of UTF-8 bytes from the DB.

I don't know what mysqlXMLField's type is (from above), but you 
should start with the array of bytes returned from the JDBC call, and 
then create the string from this array using "UTF-8" as the encoding 
name. Or just use those bytes directly when writing out the XML.

>But I'm really not into charsets and encodings...

The best thing to do is:

1. Make sure the XML you send to Solr starts with this line:

<?xml version="1.0" encoding="utf-8"?>

2. Make sure you've converted all of the text in the XML fields to 
the UTF-8 character set.

Then don't wrap those fields with CDATA.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Mime
View raw message