commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leandro Reis <lr...@adobe.com>
Subject Re: [io] support for additional character sets needed in ReversedLinesFileReader
Date Fri, 06 Mar 2015 23:20:59 GMT
Hi sebb and all,

Here’s a revised proposed addition to ReversedLinesFileReader to support
the CCJK Windows code pages:

...
} else if(charset == Charset.forName("Shift_JIS") || // Same as for UTF-8
http://www.herongyang.com/Unicode/JIS-Shift-JIS-Encoding.html
    charset == Charset.forName("windows-31j") || // Windows code page 932
(Japanese)
    charset == Charset.forName("x-windows-949") || // Windows code page
949 (Korean)
    charset == Charset.forName("gbk") || // Windows code page 936
(Simplified Chinese)
    charset == Charset.forName("x-windows-950")) { // Windows code page
950 (Traditional Chinese)
byteDecrement = 1;
} 
…

A newline byte never appears as part of a multi-byte character in any of
these encodings.

Thanks and regards,
Leandro


On 3/2/15, 4:02 PM, "Leandro Reis" <lreis@adobe.com> wrote:

>On 2 March 2015 at 21:53, sebb wrote:
>
>>>On 2 March 2015 at 20:00, Leandro Reis <lreis@adobe.com> wrote:
>>>Hi all,
>>>
>>>I¹m working on a product that uses Commons IO via Jackrabbit Oak. In the
>>>process of testing the launch of such product on Japanese Windows 2012
>>>Server R2, I came across the following exception:
>>>"(java.io.UnsupportedEncodingException: Encoding windows-31j is not
>>>supported yet (feel free to submit a patch))"
>>>
>>>windows-31j is the IANA name for Windows code page 932 (Japanese), and
>>>is
>>>returned by Charset.defaultCharset(), used in
>>>org.apache.commons.io.input.ReversedLinesFileReader [0].
>>>
>>>
>>>It looks like this issue could be addressed by adding a check for
>>>³windows-31j² to ReversedLinesFileReader(final File file, final int
>>>blockSize, final Charset encoding):
>>>
>>>
>>>...
>>>} else if(charset.equals(Charset.forName("windows-31j"))) {
>>>     byteDecrement = 1;
>>>}
>>>...
>>>
>>>Similar changes would be needed in order to support the Chinese
>>>Simplified, Chinese Traditional, and Korean versions of the same OS (I¹m
>>>checking what the corresponding encoding names are).
>>>
>>>Can someone familiar with this area of the code confirm this looks like
>>>the proper approach to addressing this?
>
>>Can a newline byte ever appear as part of a multi-byte character in any
>>of those encodings?
>No. Sources:
>- Japanese: 
>http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
>- Simplified Chinese:
>http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT
>- Korean: 
>http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT
>- Traditional Chinese:
>http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
>
>
>>>Thanks,
>>> Leandro
>>>
>>>[0] 
>>>http://svn.apache.org/viewvc/commons/proper/io/trunk/src/main/java/org/a
>>>p
>>>ache/commons/io/input/ReversedLinesFileReader.java?view=markup
>
>
>

Mime
View raw message