cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Schultz <ch...@christopherschultz.net>
Subject Getting UTF-16 encoding on dynamic content regardless of output content type
Date Tue, 30 Oct 2018 15:58:47 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

All,

I'm attempting to do everything with UTF-8 in Cocoon 2.1.11. I have a
servlet generating XML in UTF-8 encoding and I have a pipeline with a
few transforms in it, ultimately serializing to XHTML.

If I have a Unicode character in the XML which is outside of the BMP,
such as this one: 🇺🇸  (that's an American flag, in case your mail
reader doesn't render it correctly), then I end up getting a series of
bytes coming from Cocoon after the transform that look like UTF-16.

Here's what's in the XML:

<first-name>Test🇺🇸</first-name>

Just like that. The bytes in the message for the flag character are:

f0  9f  87  ba  f0  9f  87  b8

When rendering that into XHTML, I'm getting this in the output:

Test&#55356;&#56826;&#55356;&#56824;

The American flag in Unicode reference can be found here:
https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87%B8

You can see it broken down a bit better here for "Regional U":
http://www.fileformat.info/info/unicode/char/1f1fa/index.htm

and "Regional S":
http://www.fileformat.info/info/unicode/char/1f1f8/index.htm

What's happening is that some component in Cocoon has decided to
generate HTML entities instead of just emitting the character. That's
okay IMO. But what it does doesn't make sense for a UTF-8 output encodin
g.

The first two entities "&#55356;&#56826;" are the decimal numbers that
represent the UTF-16 character for that "Regional Indicator Symbol
Letter U" and they are correct... for UTF-16. If I change the output
encoding from UTF-8 to UTF0-16, then the browser will render these
correctly. Using UTF-8, they show as four of those ugly [?] characters
on the screen.

I had originally just decided to throw up my hands and use UTF-16
encoding even though it's dumb. But it seems that MSIE cannot be
convinced to use UTF-16 no matter what, and I must continue to support
MSIE. :(

So it's back to UTF-8 for me.

How can I get Cocoon to output that character (or "those characters")
correctly?

It needs to be one of the following:

&#127482;&#127480;             (HTML decimal entities)
&#x1f1fa;&#x1f1f8;             (HTML hex entities)
f0  9f  87  ba  f0  9f  87  b8 (raw UTF-8 bytes)

Does anyone know how/where this conversion is being performed ion
Cocoon? Probably in a XHTML serializer (I'm using
org.apache.cocoon.serialization.XMLSerializer). I'm using mime-type
"text/html" and <encoding>UTF-8</encoding> in my sitemap for that
serializer (the one named "xhtml"). I believe I've mads very few
changes from the default, if any.

I haven't yet figured out how to get from what Java sees (\uE50C for
the "S" for example) to &#x1f1f8;, but knowing where the code is that
is making that decision would be very helpful.

Any ideas?

- -chris

-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlvYf7cACgkQHPApP6U8
pFhSdg/+NFO0iHGiACYgLyOJoZBay3XTDLptbynh/nTk+RHua7kLoYx4OFE9kLSu
Kf5psWFNrhsr3aRiJ7zmhqronlwG8M2WP8cqSAC8HlYmxTy9eJmrVfGQMLmH4OWB
KaNmRoDW3TCTTQYTkVHFSVv1GxfZVwO1bZrILPgIRgflVNzuERqYCmrdkxRK1z3i
Qau8WKQ/sKBmIAOhlrXALCkU5yfhn6zQpD5A8mmqUZHJACxvyOFhlT+jrqrlWx47
pVmtyyXZxAMc2KqrG9jlY5fG+Jzv3FAyTuCZzZWmgPEGbrdeZdlJi5IlYI6Sm4zZ
nk5d1153wB4+y/JfU/wR4rn22XfbKpS4I1j03vfuGO/WNa1a+WEZ70M3yd6LYveK
JDX6MDFIRt+PvGcC3pxq08iBpzmTaGfaYJU9JY3Ywii51CmzCSxHNjB48NEIYS9C
KTehmgio2MVIVh2mu3p6NV4RoVF81LSiJk+q3OpsKnTAjC85WtuSO/ntLiZwFK2R
USrtpE/nZdF4fZqgSnTJMml7ogc91upcHG8HB3oz1rS256SjhH48ug1XcDAEinEK
cvwonUEKsM33l0apKdk0RdcdQXmWZJVxcOtxphzDYHW9VvaDhNp3yVDAJt+hnlgO
8Pps5av4iyW7KffHFFQf3xPEaYhZYYDniVZTSIFSDAg4OHrBJ/4=
=bW4T
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Mime
View raw message