tapestry-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "worookie" <woroo...@hotmail.com>
Subject Re: Charsets/Encoding
Date Thu, 12 Jun 2003 20:36:43 GMT
First of all, Thanks MB for answering my questions (posted in the user

You have got a very good architecture view to improve Tapestry with its

My only concern is that whether or not UTF-8 can handle all Chinese
As far as I know, the unicode committee is still working on to include all
Chinese characters (which have at least 100,000 characters).

Anyway, I couldn't provide any other better way than your proposed approach.
So, please let me know if you need any testers or assistants in this field.

Thanks for all your efforts!



From: Mind Bridge <mindbridgeweb@yahoo.com>
Subject: Charsets/Encoding
Date: Mon, 2 Jun 2003 15:38:05 +0300
Content-Type: text/plain;

Hi all,

 I am wrapping up one item now, so I thought it may be a good time to launch
a debate about this as well (I hope the impending beta is not a problem).

 There are a number of places in the code where conversions occur from char
to byte and back (basically from Java's unicode to another charset and
back). In most cases at the moment this conversion uses the JVM's default
charset. My suggestion is to make it possible to control what charsets are
applied using Tapestry's property mechanism.


 Before that, however, I would like to point out that there are TWO separate
operations that perform such a conversion:

 1) Reading the page/component template (currently using the JVM's default

 2) Writing the response to the browser and reading the request (currently
using the charset defined during the creation of HTMLWriter, utf-8 by

 Please note that there is no need to couple those together -- the charsets
used can be very different.

 A number of people on the list have pointed out that they use Big5 or some
other custom encoding and that they would not be happy with standartization
of Tapestry's encoding. Do note, however, that this applies to (1) only.
Having the ability to easily define what charset is used in a given template
or property file allows people to use the charset their editor uses (e.g.
Big5 or KOI8-R) without a problem.

 Conversion (2) on the other hand, does not seem to hold that much of
importance to the designer/developer. It does not directly affect them, and
the only thing that matters is whether the browser on the other end of the
server supports the charset used. As far as I am aware, pretty much all
browsers (including the WML ones) support UTF-8. Am I correct in stating

 Having the response charset clearly defined in the page/component/app/etc
carries certain benefits as well. The most important of which is
performance. According to the profiling done by Luis Neves, Tapestry is
greatly delayed due to the constant char -> byte conversion when generating
the response (mostly in PrintWriter). Knowing the charset ahead of time
would allow doing the conversion for the static text (template) at page load
and greatly improve the performance.


 I think the charset to be used can be defined in a property. The following
search order can be used:


 (the bottom 4 are the standard way to define global Tapestry properties)

 There should be one property for the templates/properties charset
(defaulting to the default JVM charset) and another for the response charset
(defaulting to utf-8).

 Is there any reason not to have the response charset default to utf-8? This
would allow everything, including a mix of latin, cyrillic, chinese, and
greek to appear on the page.


 To get the entire system work well with UTF-8 in the request/response
cycle, a few other small changes need to be made (this is also described on
the wiki under the 'Enabling Unicode' topic -- very helpful):

 1) Make the Form component have enctype='multipart/form-data' by default
(with ability to override, of course).

 2) Modify the Shell component to include
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
 by default (the content type, including charset, should be the content type
used by the HTML writer, of course). This is needed by some browsers (e.g.

 Is there a reason NOT to do these things?

 That's it. Hopefully this should resolve all encoding issues.

Best regards,

View raw message