poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 38230] New: - [PATCH] UnicodeString#fillFields invalid read of non US characters >=128 and <=255
Date Wed, 11 Jan 2006 22:28:56 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=38230>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=38230

           Summary: [PATCH] UnicodeString#fillFields invalid read of non US
                    characters >=128 and <=255
           Product: POI
           Version: 3.0-dev
          Platform: All
        OS/Version: other
            Status: NEW
          Keywords: PatchAvailable
          Severity: major
          Priority: P2
         Component: HSSF
        AssignedTo: poi-dev@jakarta.apache.org
        ReportedBy: per.sil@gmx.it
                CC: c.gosch@inovex.de


I had a problem reading HSSFCell values with german specific letters (umlauts).
Most probably the same difficulties apply to all characters from integer value
128 to 255.

They all have ended up with high byte having all bits set to 1. It has turned
out this is a type cast problem on J2SE 1.4.2(06). 

Casting from byte to char seems to take the highest bit of the byte to fill the
high byte of the char value. German umlaut ä (&auml;) uses 0xe4 or 11100100.
Converting this value to char results in 1111111111100100.

See this small code:
----------------------
public class ByteConverterTest {
    public static void main(String[] args) {
        byte umlautChar = (byte)0xe4;  // the German umlaut &auml; ä
        char badEncoded = (char)umlautChar;
        char goodEncoded = (char)( (short)0xff & (short)umlautChar );
        
        System.out.println("Badly converted umlaut uses hex value: " +
Integer.toHexString(badEncoded));
        System.out.println("Good converted umlaut uses hex value: " +
Integer.toHexString(goodEncoded) + "\n");
    }
}
----------------------

Output is:
----------------------
Badly converted umlaut uses hex value: ffe4
Good converted umlaut uses hex value: e4
----------------------

Attached you will find a patch to resolve this issue with the class
UnicodeString. The function fillFields uses this type of inproper type cast.
Perhaps ofer classes do as well.

Reproducible: Always (see test code)
Plattform: Windows 2k, Linux 2.6.x
JVM: J2SE 1.4.2(06) and J2SE 1.4.2(10)


For those who are experiencing the same problem but do not want to wait for this
patch making its way to CVS, you can use the following code to convert your cell
value to proper Java string:
----------------------
String cellValue = cell.getRichStringCellValue().getString();
// clean invalid type casts
if (cellValue != null) {
    char[] buffer = cellValue.toCharArray();
    StringBuffer newValue = new StringBuffer(buffer.length);
    for (int i=0; i<cellValue.length; i++) {
        char charValue = buffer[i];
        short numValue = (short)charValue;

        // strip high byte if all bits are set to 1 
       if ((numValue & 0xff00) == 0xff00)
            charValue = (char)(numValue & 0xff);

        newValue.append(charValue);
    }
        
    cellValue = newValue.toString();
}

----------------------


I have tried to find a previously entered bug report on this subject but failed.
I am sorry if i have missed it.

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/


Mime
View raw message