tapestry-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mind Bridge" <mindbridge...@yahoo.com>
Subject Charsets/Encoding
Date Mon, 02 Jun 2003 12:38:05 GMT
Hi all,

	I am wrapping up one item now, so I thought it may be a good time to launch
a debate about this as well (I hope the impending beta is not a problem).


	There are a number of places in the code where conversions occur from char
to byte and back (basically from Java's unicode to another charset and
back). In most cases at the moment this conversion uses the JVM's default
charset. My suggestion is to make it possible to control what charsets are
applied using Tapestry's property mechanism.


DECOUPLING OF CONVERSIONS:

	Before that, however, I would like to point out that there are TWO separate
operations that perform such a conversion:

	1) Reading the page/component template (currently using the JVM's default
charset)

	2) Writing the response to the browser and reading the request (currently
using the charset defined during the creation of HTMLWriter, utf-8 by
default)

	Please note that there is no need to couple those together -- the charsets
used can be very different.

	A number of people on the list have pointed out that they use Big5 or some
other custom encoding and that they would not be happy with standartization
of Tapestry's encoding. Do note, however, that this applies to (1) only.
Having the ability to easily define what charset is used in a given template
or property file allows people to use the charset their editor uses (e.g.
Big5 or KOI8-R) without a problem.

	Conversion (2) on the other hand, does not seem to hold that much of
importance to the designer/developer. It does not directly affect them, and
the only thing that matters is whether the browser on the other end of the
server supports the charset used. As far as I am aware, pretty much all
browsers (including the WML ones) support UTF-8. Am I correct in stating
that?

	Having the response charset clearly defined in the page/component/app/etc
carries certain benefits as well. The most important of which is
performance. According to the profiling done by Luis Neves, Tapestry is
greatly delayed due to the constant char -> byte conversion when generating
the response (mostly in PrintWriter). Knowing the charset ahead of time
would allow doing the conversion for the static text (template) at page load
and greatly improve the performance.


DEFINITION:

	I think the charset to be used can be defined in a property. The following
search order can be used:

	.page/.jwc
	.library
	.application
	servlet
	context
	JVM

	(the bottom 4 are the standard way to define global Tapestry properties)

	There should be one property for the templates/properties charset
(defaulting to the default JVM charset) and another for the response charset
(defaulting to utf-8).

	Is there any reason not to have the response charset default to utf-8? This
would allow everything, including a mix of latin, cyrillic, chinese, and
greek to appear on the page.

OTHER ITEMS:

	To get the entire system work well with UTF-8 in the request/response
cycle, a few other small changes need to be made (this is also described on
the wiki under the 'Enabling Unicode' topic -- very helpful):

	1) Make the Form component have enctype='multipart/form-data' by default
(with ability to override, of course).

	2) Modify the Shell component to include
		<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
	by default (the content type, including charset, should be the content type
used by the HTML writer, of course). This is needed by some browsers (e.g.
IE)

	Is there a reason NOT to do these things?


	That's it. Hopefully this should resolve all encoding issues.

Best regards,
-mb



Mime
View raw message