cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pier Fumagalli <>
Subject Re: [Help]How can I use non-ascii file name?
Date Mon, 16 Aug 2004 13:06:07 GMT
References to non-hack:


On 16 Aug 2004, at 14:02, Pier Fumagalli wrote:

> Ok, I tracked the sucker down... It's the servlet container... They  
> all decode the stupid URL using ISO-8859-1... And therefore, utterly  
> incompatible with 3/4 of the non-english-speaking world...
> At best, I was able to _HACK_ the whole thing through, by getting the  
> path info in this way:
> <WARNING note="shit-code-follows">
> new String(request.getPathInfo().getBytes("ISO-8859-1"),"UTF-8"));
> Therefore, I get the BYTES of the path-info string as if they were in  
> ISO-8859-1, and re-create a new string by taking those bytes and  
> forcing them to be in UTF-8...
> Niiiiiiiiiiiiiiiiiiice!
> Note that this stupidity also happens with accented letters (that for  
> us Italians is a big p-i-t-a).
> I'll see why this happens in Jetty, I'll poke Jen and Greg to have  
> either a fix, or an explaination and workaround... For now, brrrr, I  
> think that the hack is the only way to go...
> Oh, I checked it also on Tomcat. Same problem there as well...
> 	Pier
> On 16 Aug 2004, at 12:05, Marc Portier wrote:
>> Pier,
>> As a coincidence we recently (last week) had a similar post on  
>> xreporter-list (which uses cocoon)
>> Bad news is that I didn't track it down to the bottom yet, just some  
>> findings below:
>> (in fact the odd-char-in-filename for map:read and map:mount was one  
>> of the first things I was going to test, seems I'm already presented  
>> with the results)
>> what I did find already was this:
>> Cocoon's Request.getSitemapURI() will return an assembly of  
>> javax.servlet.http.HttpServletRequest.getServletPath()
>> + javax.servlet.http.HttpServletRequest.getPathInfo()
>> Servlet spec on those states they will be (url-) decoded
>> Thus 3 char sequences of the kind "%BYTE_HEX" will have been  
>> translated into single bytes. The obtained byte-sequence is then  
>> decoded using SOME_DECODING (my guess would be using ISO-8859-1, but  
>> haven't found yet if this is container specific, modifiable or hard  
>> noted in some spec. Only thing I found is this:  
>>, but  
>> I'm yet unsure on how this influences servlet specs, or actual  
>> container and even browser implementations for that matter)
>> Alternatively there is:
>> Cocoon's Request.getRequestURI() which maps onto the
>> javax.servlet.http.HttpServletRequest.getRequestURI()
>> This one resembles the URI as transferred over the wire: ie. not  
>> (url-)decoded, or in other words still holding the %XX sequences
>> As an extra clarification on all these the servlet spec explicitely  
>> states: (2.3 version, page 34, section SRV4.4 Request Path Elements)
>> <quote>
>> It is important to note that, *except for URL encoding differences*  
>> between the request URI and the path parts, the following equation is  
>> always true:
>> requestURI = contextPath + servletPath + pathInfo
>> </quote>
>> I (for now) assume that this is the same encoding we expect  
>> cocoon-deploy people to specify in the 'container-encoding'  
>> init-parameter in the web.xml (allowing to correctly en-re-decode  
>> request-paramater-values in case of mismatching form and container  
>> encodings)
>> Ok, above is dull data, and not much into a direction of any solution  
>> yet.  My current feeling (long shot, needs time to test and try, and  
>> based on above assumption) is that we should
>> In terms of backwards compatibility I'm unsure if we could just go  
>> about changing the semantics (histrocally implied use of iso-8859-1  
>> encoding) of getSitemapURI() or rather should deprecate and/or have a  
>> different method next to it?
>> In any case this new implementation should then probably apply the  
>> same kind of dirty en-re-decoding-trick
>> new return(getSitemapURI().getBytes(container_encoding),form_encoding)
>> as we do today with the request param values?
>> (see  
>> cocoon/environment/http/
>> sorry for the old cvs-style link, the svn version of viewcvs doesn't  
>> seem to support 'annotate' ?)
>> For the record: the fast hack/workaround in the xreporter case was  
>> exactly to apply this.
>> Attached to this I'm also seeing the trouble of mount-points in  
>> cocoon.   I've seen a number of installments needing (well, 'using'  
>> at least) some insertion of that  
>> part-of-the-URL-that-maps-to-the-mounted-sitemap to be able to have  
>> links in source xml.files refer to other resources managed by the  
>> same mounted sitemap without the need to explicitely mention that  
>> part (but have it dynamically inserted by some xsl in stead).
>> In those occasions I've seen people mostly subtract siteMapURI from  
>> requestURI to obtain that prefix part. Regarding the above  
>> observations this algorithm will however fail due to encoding  
>> differences.
>> My proposal would be to not only add a method for decoding the  
>> sitemapURI properly, but in the mean time adding the convenience  
>> method to return the mounted-sitemap-part as well on the level of  
>> cocoon's request.
>> Above are early observations that need some backing, so comments  
>> welcome. (and hoping someone beats me to this since I'm lacking the  
>> time to pursue myself)
>> -marc=
>> Pier Fumagalli wrote:
>>> On 12 Aug 2004, at 12:45, roy huang wrote:
>>>> Hi,all:
>>>>     Use reader to display jpg or gif is quite simple,like:
>>>>    <map:match pattern="*.jpg">
>>>>     <map:read mime-type="image/jpg" src="jpg/{1}.jpg" />
>>>>    </map:match>
>>>>    But if the file name is not ASCII but utf-8 or other encoding  
>>>> like 花.jpg (simplified Chinese),the resolver didn't resolve the  
>>>> name correctly,error occur:
>>>> org.apache.cocoon.ResourceNotFoundException: Error during resolving  
>>>> of the input stream:  
>>>> org.apache.excalibur.source.SourceNotFoundException: file:/C:/My  
>>>> Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/花.jpg  
>>>> doesn't exist.
>>>> How can I use non-ASCII file name in cocoon?I can't find any  
>>>> description or help in wiki or archived mail list.
>>>> Roy Huang
>>> It appears indeed as a bug...
>>> I have this sitemap snippet:
>>>     <map:match pattern="谷*">
>>>       <map:generate src="谷{1}.xml"/>
>>>       <map:transform src="welcome.xslt">
>>>         <map:parameter name="contextPath"  
>>> value="{request:contextPath}"/>
>>>       </map:transform>
>>>       <map:serialize type="xhtml"/>
>>>     </map:match>
>>> and a file on the disk called "谷理子.xml". Somewhere, when I make a  
>>> request for "http://localhost:8888/谷理子", the whole thing goes  
>>> berserk...
>>> Now, the URL is passed correctly, as I see that in the access log:
>>> INFO    (2004-08-16) 10:26.36:538   [access]  
>>> (/%e8%b0%b7%e7%90%86%e5%ad%90) main-3/CocoonServlet: '????????'  
>>> Processed by Apache Cocoon 2.1.5 in 27 milliseconds.
>>> The above-mentioned string's encoding in UTF-8 is, in fact, "E8 B0  
>>> B7 E7 90 86 E5 AD 90", so, cocoon receives it correctly, but somehow  
>>> it gets lost in the process.
>>> Now, if I modify my itemap to
>>>     <map:match pattern="tanisatoko">
>>>       <map:generate src="谷理子.xml"/>
>>>       <map:transform src="welcome.xslt">
>>>         <map:parameter name="contextPath"  
>>> value="{request:contextPath}"/>
>>>       </map:transform>
>>>       <map:serialize type="xhtml"/>
>>>     </map:match>
>>> And I make a request to "http://localhost:8888/tanisatoko", the  
>>> thing works perfectly. We can safely exclude the fact that it's the  
>>> generation process.
>>> Now, the _odd_ thing I noticed is that in those cases, I get an  
>>> error of "PipelineNotFound", not a "ResourceNotFound", which means  
>>> that the matcher seriously doesn't see that request.
>>> Changing over the matcher to a 'regexp' matcher doesn't change, so,  
>>> I bet it's the data we feed to the matcher.
>>> Now, changing that matcher to  
>>> "&#xe8;&#xb0;&#xb7;&#xe7;&#x90;&#x86;&#xe5;&#xad;&#x90;",
>>> encoding, and running it again, I get my nice page correctly.
>>> I bet that somewhere (I don't know where, but surely somewhere), the  
>>> UTF-8 encoded URL converted into a string using the current locale  
>>> (MacRoman on my system), or a default of "ISO-8859-1", before the  
>>> string is actually given to the sitemap.
>>> Not having the sources at hand at the moment, I can't do a quick  
>>> build to put out some debugging instruction, but  you get the idea.
>>>     Pier
>> -- 
>> Marc Portier                  
>> Outerthought - Open Source, Java & XML Competence Support Center
>> Read my weblog at      

View raw message