tcl-websh-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ronnie Brunner <ron...@netcetera.ch>
Subject Re: uricode.c problem
Date Fri, 28 Dec 2001 15:32:11 GMT
> > Compiling generic/uricode.c (version 1.3) I get a couple of
> > warnings: my compiler does not like line 141 (function uriDecode):
> 
> > 	 signed char *utf = NULL;
> 
> > The warning makes sense, since all the functions using utf need a
> > char* and not a signed char*.
> 
> You mean an unsigned char*?  Just using a 'char *' is dangerous,
> because it changes.  On my PPC box, for example, by default it's
> unsigned (and so the comparisons were not really comparing anything).
> 
> Hrm... the Tcl man page says:
> 
>         A Tcl_UniChar is a Unicode character repreĀ­ sented as an
>         unsigned, fixed-size quantity.  A UTF-8 charĀ­ acter is a
>         Unicode character represented as a varying- length sequence of
>         up to TCL_UTF_MAX bytes.  A multibyte UTF-8 sequence consists
>         of a lead byte followed by some number of trail bytes.
> 
> So I guess unsigned is correct, but then they should spell that out in
> the man page?  I'm a bit confused.
> 
> What does your compiler say if you use an 'unsigned char*'?

It doesn't like it. It really wants a char* only.

> > However, fixing this has to break the code, since there are some if
> > and else clauses on (utf > 0).  Couldn't these checks be changed to
> > (utf > 127)?
> 
> Sure, no problem.  But then it has to be spelled out that it is an
> unsigned char, as by default, i386 uses signed chars.
> 
> > BTW: unfortunately the test suite still works perfectly even if the
> > code is broken -> there should be some additional tests on
> > web::uridecode...
> 
> Maybe we should try some trickery with the comparison, so that we can
> just use the default 'char *' type...

I just looked at the Tcl source to see how they deal with this: from
generic/tclUtf.c:1.16 lines 293ff 

int
Tcl_UtfToUniChar(str, chPtr)
    register CONST char *str;	 /* The UTF-8 string. */
    register Tcl_UniChar *chPtr; /* Filled with the Tcl_UniChar represented
			          * by the UTF-8 string. */
{
    register int byte;
    
    /*
     * Unroll 1 to 3 byte UTF-8 sequences, use loop to handle longer ones.
     */

    byte = *((unsigned char *) str);
    if (byte < 0xC0) {
       /* single byte char ... */
    } else if (byte < 0xE0) {
       /* two bytes characters ... */
    } else if (byte < 0xF0) {
       /* three byte characters
    } else {
       /* only if TCL_UTF_MAX > 3 ... */
    }
}


------------------------------------------------------------------------
Ronnie Brunner                               ronnie.brunner@netcetera.ch
Netcetera AG, 8040 Zuerich    phone +41 1 247 79 79 Fax: +41 1 247 70 75

Mime
View raw message