openoffice-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis E. Hamilton" <dennis.hamil...@acm.org>
Subject RE: Need help with strings
Date Thu, 09 Apr 2015 17:00:36 GMT
Hi!

You are digging into my favorite subject.

I am assuming you are talking about strings within the MathML and that it is in some form
of XML. In that case:

If it is XML, the encoding can be specified in the <? ... ?> XML prologue.  Sniffing
for this prologue will determine such things as whether UTF8 or UTF16, and big-endian or little-endian.
 If single-byte, that will usually mean some kind of code page which has a subset of ASCII
as a common subset of a larger encoding, such as Western European.  In that case, one can
read the content of the prefix to see what it says, because it should be in a simple, pure
ASCII form.  Even if it is a double-byte character encoding, such as Shift-JIS, the prologue
only needs the single-byte portions that are the same as ASCII.

The default, however, depends on the MIME type of the XML file.  Text/xml and application/xml
have different defaults.  Also, MIME types can have parameters that specify character sets.

The way Windows manages this also includes using a Unicode prefix on UTF8 (big-endian, I think).
 These are not uniformly used across platforms.

Internally, because ODF and AOO are Unicode based, it is necessary to translate all arriving
text into Unicode for internal storage and use by the application.  To do otherwise, lies
madness.  There are difficulties with this, because Unicode allows local specializations.
This comes up in craziness around Symbol fonts that do not have common Unicode correspondence.
 (Bullets in AOO have this disease.)

I have probably provided more information than you require.  I love this subject.  

I have not looked at your code.

 - Dennis

PS: The default representation of XML inside OOXML is UTF16 as I recall.  I could be mistaken.

-----Original Message-----
From: Regina Henschel [mailto:rb.henschel@t-online.de] 
Sent: Wednesday, April 8, 2015 12:02
To: AOO dev
Subject: Need help with strings

Hi all,

I'm going to improve the MathML type detection. Currently there exist 
files, that can be opened or imported fine, when the type detection 
would allow it. https://bz.apache.org/ooo/show_bug.cgi?id=126230

I have attached a C++ file to show what I want to do.
The problem is, that MathML does not need to be encoded in utf-8 but can 
have any other encoding. For example MS Windows "Math Input Control" 
exports formulas in utf-16.

So my question is, which kind of string can I use, that is able to 
detect/use utf-16 and has the needed methods similar to C++ string 
methods find, rfind, insert, substring, clear, erase? Does AOO has such 
kind of string?

It is possible to get the encoding from the MathML file or set default 
utf-8, in case that information is needed for to instantiate a string 
object.

Kind regards
Regina





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
For additional commands, e-mail: dev-help@openoffice.apache.org


Mime
View raw message