lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Wittern <>
Subject Re: invalid XML character
Date Sun, 02 Mar 2008 04:26:48 GMT
Yonik Seeley wrote:
> On Sat, Mar 1, 2008 at 6:47 PM, Leonardo Santagada <> wrote:
>>  Can't he put this code on the server before the xml parsing somehow? I
>>  would do like you said and do it on the client, but just out of
>>  curiosity is this really impossible?
> We'd have to implement our own xml parser (or a subset of one) for that.
I am not sure this is such a good idea.  After all, XML does not allow 
these characters, so if you write your own parser, that would not be a 
standard compliant XML parser and you would need to more or less 
re-invent the whole tool-chain for your 
slightly-modified-but-not-quite-XML format. 

A better strategy I think would be to put the responsibility on the 
client to send correct XML if they say they send XML.  If necessary, a 
different escaping mechanism like the \u<codepoint> used in many 
programming languages could be used for the XML transport layer.

> A simple search+replace of &#xx; could do the wrong thing I think
> (might be an actual literal in a CDATA block for example).  
This would also not get you beyond the XML parser, since to the parser 
&#6; looks exactly the same as the character expressed with its binary 

> The
> easiest place to fix it is before the field values are serialized into
> XML.


All the best,



 Christian Wittern 
 Institute for Research in Humanities, Kyoto University
 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN

View raw message