lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Klaas <mike.kl...@gmail.com>
Subject Re: solr.py problems with german "Umlaute"
Date Thu, 06 Sep 2007 20:29:44 GMT

On 6-Sep-07, at 12:13 PM, Yonik Seeley wrote:

> On 9/6/07, Brian Carmalt <bca@contact.de> wrote:
>> Try it with title.encode('utf-8').
>> As in: kw =
>> {'id':'12','title':title.encode 
>> ('utf-8'),'system':'plone','url':'http://www.google.de'}
>
> It seems like the client library should be responsible for encoding,
> not the user.
> So try changing
> title="Übersicht"
>   into a unicode string via
> title=u"Übersicht"
>
> And that should hopefully get your test program working.
> If it doesn't it's probably a solr.py bug and should be fixed there.

It may or may not, depending on the vagaries of the encoding in his  
text editor.

What python gets when you enter u'é' is the byte sequence  
corresponding to the encoding of your editor.  For instance, my  
terminal is set to utf-8 and when I type in é it is equivalent to  
entering the bytes C3 A9:

In [5]: 'é'
Out[5]: '\xc3\xa9'

Prepending u does not work, because you are telling python that you  
want these two bytes as unicode characters.  Note that this could be  
fixed by setting python's default encoding to match.

In [1]: u'é'
Out[1]: u'\xc3\xa9'
In [11]: print u'é'
é

The proper thing to do is to interpret the byte sequence given the  
proper encoding:

'é'.decode('utf-8')
Out[3]: u'\xe9'

or enter the desired unicode character directly:

 >>> u'\u00e9'
u'\xe9'
 >>> print u'\u00e9'
é

This is less complicated in the usual case of reading data from a  
file, because the encoding should be known (terminal encoding issues  
are much trickier).  Use codecs.open() to get a unicode-output text  
stream.

-Mike 
Mime
View raw message