quetz-mod_python-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Graham Dumpleton <grah...@dscpl.com.au>
Subject Re: mod_python.publisher : proposal for a few implementation changes
Date Sat, 30 Apr 2005 11:42:19 GMT
You must be getting sick of me picking apart everything you put up. :-)

Problem I see with this is that it wouldn't be getting applied in all
cases where Unicode strings would be getting returned. Imagine:

   class _Object:
     def __str__(self):
       return u'123'

   object = _Object()

The "object" variable isn't a Unicode string and if "object" is 
accessed,
then str() gets applied to it and __str__() will return a Unicode 
string.
This therefore bypasses your attempt to convert it using the appropriate
encoding.

Interesting that in this case the Unicode string also gets delivered 
direct
to req.write() as well.

I would suggest that the whole encoding issue be left up to the 
developer
to handle rather than trying to be smart about it and make it automatic.
The developer is going to know what they want, where as we would be 
making
assumptions and could get it wrong.

Graham

On 30/04/2005, at 9:28 PM, Nicolas Lehuen wrote:

> I think in this case the default conversion used is UTF8. Ideally, a
> developer returning Unicode strings from functions should have a way
> to decide in what encoding (UTF-8, iso-latin-1, etc.) the string
> should be returned to the client.
>
> One possible way to do that would be to parse the content-type header,
> i.e. if the developer set the content type header to "text/html;
> charset=iso-8859-1", then we know the developer expect the result to
> be encoded in iso-8859-1, so we can do result =
> object.encode('iso-8859-1').
>
> Here is some tentative code for this :
>
> re_charset = re.compile(r"charset\s*=\s*([^\s;]+)");
>
> def publish_object(req, object):
>     if callable(object):
>         req.form = util.FieldStorage(req, keep_blank_values=1)
>         return publish_object(req,util.apply_fs_data(object, req.form, 
> req=req))
>     elif hasattr(object,'__iter__'):
>         result = False
>         for item in object:
>             result |= publish_object(req,item)
>         return result
>     else:
>         if object is None:
>             return False
>         elif isinstance(object,UnicodeType):
>             # We try to detect the character encoding
>             # from the Content-Type header
>             if req._content_type_set:
>                 charset = re_charset.search(req.content_type)
>                 if charset:
>                     charset = charset.group(1)
>                 else:
>                     charset = 'UTF8'
>                     req.content_type += '; charset=UTF8'
>             else:
>                 charset = 'UTF8'
>
>             result = object.encode(charset)
>         else:
>             result = str(object)
>
>     [...]
>
> Regards,
> Nicolas
>
>
> On 4/30/05, Graham Dumpleton <grahamd@dscpl.com.au> wrote:
>>
>> On 30/04/2005, at 6:37 PM, Nicolas Lehuen wrote:
>>>         elif isinstance(object,UnicodeType):
>>>             # TODO : this skips all encoding issues, which is VERY 
>>> BAD
>>>             # I don't even understand how the req.write below can 
>>> work
>>> !
>>>             result = object
>>>         else:
>>>             result = str(object)
>>
>> What do you see is the issue that required an explicit check for
>> UnicodeType
>> and avoidance of converting it with str().
>>
>> As the code is above, req.write() will be called with the
>> UnicodeObject. This
>> will work provided that the Unicode string can be converted into a
>> normal
>> string using the default encoding. Ie., in underlying C code
>> PyArg_ParseTuple
>> will use "s", meaning:
>>
>> "s" (string or Unicode object) [char *]
>>    Convert a Python string or Unicode object to a C pointer to a
>> character
>>    string. You must not provide storage for the string itself; a 
>> pointer
>>    to an existing string is stored into the character pointer variable
>>    whose address you pass. The C string is null-terminated. The Python
>>    string must not contain embedded null bytes; if it does, a 
>> TypeError
>>    exception is raised. Unicode objects are converted to C strings 
>> using
>>    the default encoding. If this conversion fails, an UnicodeError is
>> raised.
>>
>> I think though that applying str() in the Python code to the Unicode
>> string
>> probably yields the same result. Ie., str(u'123') results in encode()
>> method
>> of Unicode string object being called.
>>
>> S.encode([encoding[,errors]]) -> string
>>
>> Return an encoded string version of S. Default encoding is the current
>> default string encoding. errors may be given to set a different error
>> handling scheme. Default is 'strict' meaning that encoding errors 
>> raise
>> a UnicodeEncodeError. Other possible values are 'ignore', 'replace' 
>> and
>> 'xmlcharrefreplace' as well as any other name registered with
>> codecs.register_error that can handle UnicodeEncodeErrors.
>>
>> In other words, I don't believe there is any difference between
>> converting
>> it using str() before the call to req.write() as there is passing
>> Unicode
>> string direct to req.write(). Thus, explicit check for UnicodeType
>> probably
>> not required.
>>
>> Graham
>>
>>


Mime
View raw message