perl-asp mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Warren Young <>
Subject UTF-8 HOWTO
Date Thu, 23 Feb 2006 17:27:38 GMT
I finally got around to converting our Apache::ASP application so that 
it uses UTF-8 throughout, instead of Latin-1.  I learned a few things 
that aren't discussed in the archives, so I'm setting them down here for 
others to find.

1. It's best if you use newer Perls.  5.8.0 is adequate, but has known 
bugs in its Unicode handling.  When run under 5.8.0, our program 
exhibits a double UTF-8 conversion in one circumstance, while the other 
screens show the data correctly.  When the same program is run under 
5.8.5, all screens show the correct data.  While it's theoretically 
possible to get Perl 5.6.x to cope with UTF-8 data, I don't recommend 
messing with it.  A few years ago when I first tried using UTF-8, I was 
using 5.6 and had many problems with Perl smashing my data back to 
Latin-1 incorrectly.

2. Also use the newest mod_perl you can.  There are known Unicode bugs 
in mod_perl 1.99_09 and older.

3. You must say "use utf8;" at the top of each ASP file.  If you use 
$Response->Include(), each included file also has to say "use utf8;". 
The same goes for any Perl modules you use, if you will be passing UTF-8 
strings through them.

4. mod_perl doesn't set the LANG environment variable unless you ask it 
to.  Perls 5.8 and newer use the LANG environment variable (among other 
things) to decide whether to use UTF-8 by default or not.  I didn't find 
it to be necessary to ask mod_perl to set this variable in my program, 
but it can't hurt to do it.  If nothing else, it's one less thing you 
have to blame if your pages aren't showing the right data.  In your 
httpd.conf, right after "PerlModule Apache::ASP", say "PerlPassEnv 
LANG".  This will pass your system's default value for LANG through to 
the mod_perl instances, and thus to Apache::ASP.

5. Ensure that your data source is passing UTF-8 data correctly.  In our 
program, the data comes in via an XML path, so we needed to inform the 
XML parser that the data is UTF-8.  Otherwise, the XML parser assumes 
it's Latin-1, and you get a double UTF-8 conversion.

6. Finally, you need to communicate that the data is UTF-8 to the 
browser.  This is done with the Content-Type HTTP header, which you can 
set in a number of ways.  I like to do it in a <meta> tag at the top of 
each file that will contain UTF-8 data:

     <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Alternately, if all documents on your server should be treated as UTF-8, 
there's an Apache configuration directive to force all output to be 
declared as UTF-8.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message