lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Update with non UTF-8 characters
Date Wed, 01 Oct 2014 23:59:24 GMT

: I am indexing Solr 4.9.0 using the /update request handler and am getting
: errors from Tika - Illegal IOException from
: org.apache.tika.parser.xml.DcXMLParser@74ce3bea which is caused by
: MalFormedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. I

FWIW: that error appears to have come from /update/extract .. hard to be 
sure w/o full stack trace from the logs ... but i'll assume that's 
just a copy/paste mistake from the second test you mentioned trying, 
and assume your assessment is correct...

: believe that this is the result of attempting to pass information to Solr
: via CURL as XML in which the data has non UTF characters such as Smart
: Quotes (the irony of that name is amazing). So when I:

...and focus on the example command you mentioned...

: curl -H "Content-Type: text/xml"
: --data-binary "<add><doc><field name=\"id\">123456</field><field
: name=\"observation\">This is some text that was passed from the .NET
: application to Solr for indexing. Users typically write in Word then copy
: and paste into the .NET application UI which then passes everything to Solr
: for indexing. If there are "smart quotes" it crashes, but "regular quotes"
: are fine.</field></doc></add>"

if you tell solr you are sending it XML, then you have to send it valid 
XML.  if you don't specify a charset -- either in the Content-Type, or in 
an XML prolog declaration -- then the XML spec says UTF-8 must be assumed.  
if the bytes in your doc aren't UTF-8, it's not a valid XML file, etc....

if you actually know what charset you are sending, then you can specify it 
-- and as long as your JVM implementation understands it, it should work.

you can't however just read some raw bytes from somewhere, slap some 
xml-ish lookin strings in front & behind, and hope you have valid xml.

if you use a good XML serialization library in your .Net application to 
generate the messages you send to Solr, then the serialization library 
should help mitigate this probem -- either by specifying the correct 
encoding in the xml prolog it generates for you in it's output, or by 
converting the input "strings" to utf-8, or by giving you a good error 
if/when you ask it to serialize characters that can't be serialized in XML 
(there are some, like null bytes and control sequence).


View raw message