lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Teague James" <>
Subject Update with non UTF-8 characters
Date Wed, 01 Oct 2014 19:15:36 GMT

I am indexing Solr 4.9.0 using the /update request handler and am getting
errors from Tika - Illegal IOException from
org.apache.tika.parser.xml.DcXMLParser@74ce3bea which is caused by
MalFormedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. I
believe that this is the result of attempting to pass information to Solr
via CURL as XML in which the data has non UTF characters such as Smart
Quotes (the irony of that name is amazing). So when I:

curl -H "Content-Type: text/xml"
--data-binary "<add><doc><field name=\"id\">123456</field><field
name=\"observation\">This is some text that was passed from the .NET
application to Solr for indexing. Users typically write in Word then copy
and paste into the .NET application UI which then passes everything to Solr
for indexing. If there are "smart quotes" it crashes, but "regular quotes"
are fine.</field></doc></add>"

I also tried /update/extract, but since this isn't an actual document it
still doesn't work. 

Is there a way to cope with these non UTF-8 characters using the /update
method I'm currently using by altering the content type or something? Maybe
altering the request handler? Or is it by virtue of text/xml that I cannot
use these characters and need to write logic into the application to strip
them out?

Any thoughts or advice would be appreciated! Thanks!


View raw message