lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabio Confalonieri <fa...@zero.it>
Subject International Charsets in embedded XML
Date Tue, 13 Jun 2006 13:06:48 GMT

(sorry the last one got wrongly posted)

Here I am again with charset encoding problems:

I need to store XML in a document field. I declare it as string and surround
it in CData when I post the add xml.
Now the problem is I have some Iternational char in the XML: say  ì or à and
also € (i don't know if You can read these).

When i get back from Solr the XML field strange things happens:

- first one: € get converted to ? (I see it in the index looking with luke)

- if there is an ì (accented ì) I get malformed XML back using with firefox
and IE:

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <responseHeader><status>0</status><QTime>0</QTime></responseHeader>
  <result numFound="1" start="0">
    <doc>
      <str name="categoryid">/relazioni/</str>
      <str name="facetXML">&lt;?xml version="1.0" encoding="UTF-8"?>&lt;xml>
	&lt;filter field="typecamper_s">
	&lt;item value="autocaravanmansardato">Autocaravan ìMansardato</item>
							                           ^ HERE begins the problem: from now on no
more shielding of "<"

	<item value="semintegrale">Semintegrale</item>
	</filter>
	</xml>
	
	HERE continues the output, as it should have been shielded after the
problem above:
	
	&lt;/item>&lt;item value="semintegrale">Semintegrale&lt;/item>&lt;/filter>
	&lt;/xml>
      </str>
      ...
    </doc>
  </result>
</response>

But if i get the same document in my request handler (as a Document
structure) I don't have any problem parsing the XML and get the correct
char.
I have traced the XML.escape and the problem is not there so it's somewere
between XMLWriter and Jetty (I've tried the last one 5.1.11).

- if i put some international char in a normal string field I see Solr
stores the UTF-8 (i Think) encoded char in a string as in a text field type.

The question is: apart from the malformed XML issue, what is the better way
to deal with internationa charsets ?

Thank You

Fabio
--
View this message in context: http://www.nabble.com/International-Charsets-in-embedded-XML-t1780147.html#a4846383
Sent from the Solr - User forum at Nabble.com.


Mime
View raw message