quetz-mod_python-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Lehuen <nicolas.leh...@gmail.com>
Subject Re: mod_python.publisher : proposal for a few implementation changes
Date Sat, 30 Apr 2005 12:24:01 GMT
Wow. I'm working on another project right now which involves a C++
core with Python and Java mappings (using SWIG). I've just got
confused and assumed that the Java Native Interface behaviour of
exchanging string data in UTF8 format was also found in Python. Sorry.

So, all this relies on the default platform encoding. How nice. The
reason why you don't find sys.setdefaultencoding() is because this
method is deleted from the module after the module is loaded,
presumably to prevent developers to change the default encoding on the
fly. I remember being mad at Python when I first discovered that (I
was trying to remove this dumb 'ascii' default encoding).

This is one more reason NOT to let the system handle the writing of
unicode strings on the request output stream. The server's default
encoding could be any encoding, and there is no guarantee that this
encoding is good for the content you want to send. My example about
French accentuated still holds ; that's simple, if I want to return
u'café' on my computer, I get this :

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 3: ordinal not in range(128)

It's not very useful to be able to return unicode strings if the only
codepoints that are allowed are those that have a mapping in ASCII...

So we might as well drop the Unicode support and tell the developer to
handle the encoding himself, OR extract the desired encoding from the
Content-Type header and handle the encoding in the publisher.

Regards,

Nicolas


On 4/30/05, Graham Dumpleton <grahamd@dscpl.com.au> wrote:
> 
> On 30/04/2005, at 9:40 PM, Nicolas Lehuen wrote:
> 
> > Graham, the encoding used by PyArg_ParseTuple is indeed UTF-8, whereas
> > str(unicode_string) uses the default encoding of the platform Python
> > is running on, which is unpredictable (for example, for years now
> > under win32 it has been ASCII even though there are ways to get the
> > default encoding specific to the current setup ; I suspect the
> > situation is not better on other platforms).
> >
> > Thus, if we removed the check for UnicodeType and simply did result =
> > str(object) for unicode string, we would have runtime exceptions,
> > because if the string contains accents, under win32, the default
> > encoder (ascii) will complain that it does not know how to encode
> > them.
> >
> > I'd rather have the developer choose explicitely the encoding he
> > wishes to use, with a default to UTF8, through the content-type
> > header.
> 
> Hmmm, getting confusing. :-(
> 
> The code says:
> 
>      if (encoding == NULL)
>          encoding = PyUnicode_GetDefaultEncoding();
> 
>      /* Shortcuts for common default encodings */
>      if (errors == NULL) {
>          if (strcmp(encoding, "utf-8") == 0)
>              return PyUnicode_AsUTF8String(unicode);
>          else if (strcmp(encoding, "latin-1") == 0)
>              return PyUnicode_AsLatin1String(unicode);
> #if defined(MS_WINDOWS) && defined(HAVE_USABLE_WCHAR_T)
>          else if (strcmp(encoding, "mbcs") == 0)
>              return PyUnicode_AsMBCSString(unicode);
> #endif
>          else if (strcmp(encoding, "ascii") == 0)
>              return PyUnicode_AsASCIIString(unicode);
>      }
> 
>      /* Encode via the codec registry */
>      v = PyCodec_Encode(unicode, encoding, errors);
> 
> Thus default doesn't seem to be UTF-8 but is what ever the default
> encoding is as would be used by str().
> 
> Maybe mod_python should have an Apache configuration file option which
> allows you to set the default encoding. Internally it could call:
> 
>    PyUnicode_SetDefaultEncoding()
> 
> The option would only be able to be set outside of any <Directory> or
> other directives. Ie., same level as PythonImport. If the option is
> not set, mod_python could forcibly set it to something which makes
> more sense in a web environment and would cause less problems. For
> example, could set it to "UTF-8" if that works better.
> 
> Only thing I am not sure about is at what version of Python this
> function was introduced. Am a bit confused that my Python 2.3 on
> Mac OS X doesn't have sys.setdefaultencoding() yet in the Python 2.3.4
> source code I have, it is present. I presume that the underlying C
> function would still be there though.
> 
> Graham
> 
>

Mime
View raw message