lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jérôme Etévé <jerome.et...@gmail.com>
Subject Re: character encoding issue
Date Wed, 04 Nov 2009 18:40:20 GMT
Hi,

 How do you post your data to solr? If it's by posting XML, then it
should be properly encoded in UTF-8 (which is the XML default).
Regardless of what's in the DB (which can be a mystery with MySQL).

At query time, if the XML writer is used, then it's encoded in UTF-8.
If the json one is used, I think it's the same. Because json is
unicode compliant by nature (javascript).

According to what you say, I would bet for a PHP problem. It seems PHP
takes the correct UTF8 octets from solr and displays them as latin1
encoding (hence the strange characters). You need to
- either output your pages in UTF-8
- or decode the octets given by solr to a unicode string and let it be
encoded as latin1 for output (with the risk of loosing non-latin1
encodable characters).

I hope it helps.

J.

2009/11/4 Jonathan Hendler <jonathan.hendler@gmail.com>:
> Hi Peter,
>
> I have the same set of issues and will look for a response here.
>
> Sometimes those other chars can be create at the time of input (like
> extraction from a Microsoft Office doc from third part tool for example).
> But MySQL looking OK in the browser might be because the encoding of MySQL
> was not the same as the original text. Say for example that the collation of
> MySQL is Latin, and the document was UTF-8. When a browser renders, it might
> assume chars are UTF-8, but SOLR might be taking the table type literally in
> the DIH (Latin1 Swedish for example). Could also be the way PHP doesn't
> handle UTF-8 well and it depends on your client.
>
> Don't think it has anything to do with Jetty - I use Resin.
>
> Hope that helps,
>
> - Jonathan
>
>
> On Nov 4, 2009, at 8:48 AM, Peter Hedlund wrote:
>
>> I'm having a problem with character encoding.  The data that I'm indexing
>> with SOLR is being pulled from a MySQL database and then the index is being
>> integrated into a PHP application.  When I display the text from the SOLR
>> index it's full of strange characters (–, é, etc...).  However, when I
>> bypass SOLR and access the data from the MySQL table directly and write to
>> the browser I don't see any problems with em-dashes and accented characters.
>>
>> Is this a JETTY issue or a SOLR issue or something else?  (It's not simply
>> an issue of including <meta http-equiv="Content-Type"
>> content="text/html;charset=UTF-8"> either)
>>
>> Thanks for any help.
>>
>> Peter Hedlund
>>
>>
>
>



-- 
Jerome Eteve.
http://www.eteve.net
jerome@eteve.net

Mime
View raw message