openoffice-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Herbert Duerr <>
Subject Re: Improvements of OUString
Date Tue, 03 Dec 2013 16:27:55 GMT
On 03.12.2013 15:37, Andre Fischer wrote:
> On 03.12.2013 14:32, Herbert Duerr wrote:
>> On 03.12.2013 13:02, Andre Fischer wrote:
>>> On 03.12.2013 10:35, Herbert Duerr wrote:
>>>> On 03.12.2013 09:13, Andre Fischer wrote:
>>>> [...]
>>>> "The method isEmpty() returns true if the string is empty. If the
>>>> length of the string is one or two or three or any number bigger than
>>>> zero then isEmpty() returns false."
>>> Additionally to this almost correct statement one could mention that
>>> isEmpty() is preferred over getLength()>0 and why.
>> Yes, it is preferred for checking the emptiness because it directly
>> expresses what it checks.
>> In general it is a good idea to check for emptiness instead of
>> counting the elements and then comparing against zero. Its the old
>> "interface vs. implementation detail" question. The result will be the
>> same from a mathematical standpoint but the effort to get this result
>> may be different. From an algorithmic complexity standpoint an
>> emptiness check is always equal or better. Maybe a mathematician can
>> provide some insights from the set theory on this question?
>> By the way: the String class of Java>=6 got its isEmpty() method for
>> the same reasons.
> Can you add some of this to the documentation of isEmpty()? (maybe don't
> mention set theory)

Great idea. As the isEmpty() method from Java's String matches our new 
method maybe we should leverage their extensive reference [1] on this 
topic too. On the other hand that documentation was probably written for 
more experienced developers than the ones you'd like to attract.


>>> We should drop our support for ASCII?
>> UTF-8 contains ASCII. This was one of its most important design goals
>> and IMHO is a key factor that made this encoding such a big success.
>> [...]
> Hm, UTF-8 is not identical to ASCII.  What if I want to write an
> OUString to stdout?  Does a regular printf support UTF-8 or would I need
> a conversion from UTF-8 to ASCII for that?

If you have an ASCII string then you can directly print it in an UTF-8 
locale. No conversion needed. Also the inverse is true: if that string 
was encoded as UTF-8 then you can print it directly in an ASCII 
compatible locale. No conversion needed for the output. The result would 
be exactly the same.

printf() and friends support the encoding defined by the LC_CTYPE 
environment variable. Nowadays this is very very often UTF-8, which is 
backward compatible with ASCII.

Some encodings are not ASCII compatible though, e.g. EBCDIC or DBCS 
(double-byte character sets). If you printed ASCII text in such 
environments without converting them first then you'd get gibberish. So 
if you want to make sure that what you want is what you get then you 
should always convert to the local encoding as determined by 

But ASCII and UTF-8 encodings are quite dominant nowadays, especially on 
developer machines. While we could fix all debug-printing for non-ASCII 
compatible environments I suggest not to invest too much energy into 
such a task. The number of developers we'd win by supporting e.g. EBCDIC 
based development environments vs. the developer investment we'd have to 
spend to achieve this support would most probably be negative.

>>>>> [...]
>>>>>      ::rtl::OUStringToOString(sOUStringVariable,
>>>> This awful construct could be made much simpler if our strings were
>>>> always unicode (UTF-8/UTF-16/UTF-32).
>>> I thought that OUString is UTF-16 and that that where the cause, not the
>>> solution of the conversion problems.
>> The complexity of the awful construct comes from the use of the
>> general purpose machinery for an N:1 conversion (with N being the
>> number of supported byte encodings). A 1:1 conversion (UTF-8 <->
>> UTF-16) is much simpler.
> I think you are mixing up two concepts here.   One is the ability to
> convert an OUString to/from all text encodings defined
> sal/in/rtl/textenc.h.  The other is a possible replacement of the
> OUString implementation of UTF-16 with UTF-8.

IMHO O*Strings should only support unicode, be it UTF-8 or UTF-16. 
Mapping between these two variants would be a 1:1 thing.

These OStrings are the strings used all over the office. Only in the few 
corners of the code that have to deal with non-unicode strings the more 
than 94 encodings should be supported: for converting from/to our then 
unicode-only O*Strings.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message