perl-asp mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arnon Weinberg <>
Subject Re: Output character encoding
Date Fri, 15 Jun 2012 04:34:16 GMT

Thanks very much Josh for investigating this - it saved me some time 
narrowing down the issue. Even still, I did spend quite a lot of time 
working out a solution for my needs, and still I don't think it is 
generalizable as-is. However, in case someone else wants to give it a 
crack, I provide details below.

On 2012-06-05 19:30, Josh Chamas wrote:
> doing this is where we have a problem:
> <% print Encode::decode('ISO-8859-1',"\xE2"); %>
> and immediately in the Apache::ASP::Response::Write() method the data 
> has already been converted incorrectly

The fact that such a simple use of Encode causes an issue is a little 
surprising. Surely others are using Apache::ASP in multi-language 
environments - is no one using Encode this way? How are others coping 
with this limitation right now?

> Its as if by merely going through the tied interface that data goes 
> through some conversion process.

Not quite, as the same results happen without a tie'd interface. The 
"use bytes" pragma is what causes the conversion (see test script below).

> Apache::ASP::Response does a "use bytes" which is to deal with the 
> output stream correctly I believe this is around content length 
> calculations.
> I think this is fine here, and turning this off makes things worse for 
> these examples.

It looks like "use bytes" is now deprecated and should indeed be 
removed. The documentation doesn't mention any trivial substitute. 
However, this pragma mostly just overrides some built-in functions with 
byte-oriented versions. So I made the following changes to
- changed use bytes => no bytes (just import the namespace)
- changed all occurrences of length() => bytes::length()
This resolved the mixed-encoding issue originally posted, but introduced 
a new (more manageable) issue.

For debugging purposes, I peeked at the "UTF-8 flag" (Perl's internal 
flag that indicates that a string has a known decoding). This flag 
should be transparent in principle, but it helped make sense of the 
behaviour of Apache::ASP.
Results of testing are summarized as follows:

1. Testing Perl/CGI, asp-perl, and Apache::ASP, all 3 give the same 
results with the "use bytes" pragma turned on:
- For any string with the UTF-8 flag off, output is correctly encoded.
- Any string with the flag on is (double-)encoded as UTF-8, regardless 
of the actual output encoding.
2. Testing Perl/CGI and asp-perl with "no bytes" produces correct results:
- The UTF-8 flag does not affect output - it is correctly encoded in 
every case.
- However, an interesting test case is that of the double-encoding 
problem (see This 
case is indicative of bad code, so is not a concern here, but it 
illustrates how a tie'd filehandle differs from plain STDOUT. In this 
case, a single "wide character" double-encodes the entire output (with 
buffering on, this can be the entire page), instead of just the string.
- These test cases are demonstrated by the script below.
3. Testing Apache::ASP with "no bytes" produces different results from 
the command-line (asp-perl) version, as well as different results from 
Perl/CGI running on Apache. This suggests an interaction effect between 
Apache and Apache::ASP (both are required to produce these results).
- With the UTF-8 flag off, output is correctly encoded as before.
- However, with "no bytes", Apache::ASP, and the UTF-8 flag on, the 
entire output is double-encoded. This result is similar to the 
double-encoding problem in the previous test case, except that it 
doesn't require a "wide character" - any string with the UTF-8 flag on 
will do.

This test script demonstrates all but the last test case:


use Encode;

foreach ( "STDOUT", "tie_use_bytes", "tie_no_bytes" )
print "$_: ";
tie *FH, $_ if ! /^S/;
my $STDOUT = select ( FH ) if ! /^S/;
print "\x{263a}",
print "\n";
close ( FH ) if ! /^S/;
select ( $STDOUT ) if ! /^S/;

use strict;

package tie_use_bytes;
use bytes;

sub TIEHANDLE { bless {}, shift; }
sub PRINT { shift()->{out} .= join ( $,, @_ ); }
sub CLOSE { print STDOUT delete ( shift()->{out} ); }

package tie_no_bytes;
no bytes;

sub TIEHANDLE { bless {}, shift; }
sub PRINT { shift()->{out} .= join ( $,, @_ ); }
sub CLOSE { print STDOUT delete ( shift()->{out} ); }

# Output: ##################

Wide character in print at ...
STDOUT: ☺ââ # STDOUT output is correct in all cases
tie_use_bytes: ☺ââ # with "use bytes", the UTF-8-flagged 2nd character 
is double-encoded
Wide character in print at ...
tie_no_bytes: ☺ââ # with "no bytes", the output is correct, but a 
"wide character" double-encodes the entire string because of the way the 
tie'd file handle is implemented


By the way, if it's getting difficult to wrap your head around this, 
you're not alone.

At this point, I peeked at the $Response->{out} data buffer, and could 
see that it was encoded correctly. However, the output from Apache (when 
the UTF-8 flag is on) was not correct, suggesting that Apache is doing 
something to encode the string in this case.
I decided therefore to address the problem by turning off the UTF-8 
flag. The most fault-tolerant method I managed to come up with to do 
this was the following:

= Encode::encode ( 'ISO-8859-1', ${$Response->{BinaryRef}},
sub{ Encode::encode ( 'UTF-8', chr ( shift() ) ) } )
if ! grep ( /^utf8$/, PerlIO::get_layers ( STDOUT ) );

which can go at the top of the $Response->Flush() method, or in 

With this solution I can now modify Apache::ASP's output encoding (eg, 
using binmode ( STDOUT );), as originally desired, and the output 
appears correct in all my test cases.

Arnon Weinberg

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message