poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carey Sublette <carey.suble...@overture.com>
Subject Advice Requested for Adding Encoding Handling Enhancement to HSSF Package
Date Wed, 08 Jan 2003 20:57:06 GMT
Hi:

I am looking at making an enhancement to the HSSF package to allow control
of how data encodings are interpreted, and could use a bit of advice.

First a description of the problem that motivates my efforts:

I am running POI on a Unix server, and am reading XLS documents created by
Windows users with Excel. 

Now, Microsoft uses an 8 bit symbol encoding method called codepage-1252
which is identical to the 8 bit subset of Unicode called Latin-1 EXCEPT that
Microsoft 'extended' the symbol set by assigning additional symbols that are
not in Latin-1 (but are represented in the full Unicode symbol set) to a
reserved (and unused) range of values, from 0x80 to 0x9f.

So the Excel produced documents have these non-Unicode character values in
them. If POI is running on a Windows platform, there seems to be no problem,
the default encoding for that platform is apparently used, and the Unicode
equivalents of the 1252 symbols ends up being returned. All is good.

But on Unix this doesn't happen. The default encoding there is UTF-8 (I
think) Unicode, which knows nothing of these illegal 1252 values. They don't
get mapped on to the Unicode equivalents and all I get back are question
marks. Not good.

I see the actual input document is read as bytes by POIFS, and then HSSF
does stuff to construct the string table, and eventually returns the string.
I'm trying to track down where the bytes are converted to strings, so that I
can devise a good scheme for for providing control over this. Instead of a
one-shot fix-up for this particular problem I'd like to provide a general
mechanism for explicitly specifying the encoding for reading and maybe
writing, similar to the InputStreamReader and OutputStreamWriter.

Can someone tell me the class.method where I can find a hook into the byte
to string conversion?

Any general comments about my proposed enhancement are welcome.

P.S. I emailed Andrew about this back in the spring, but couldn't follow up
on it at the time. My company's email system has since purged the
correspondence, and I don't recollect what was said at the time.



Mime
View raw message