commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LANG-1406) StringIndexOutOfBoundsException in StringUtils.replaceIgnoreCase
Date Thu, 09 Aug 2018 08:46:00 GMT

    [ https://issues.apache.org/jira/browse/LANG-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574490#comment-16574490
] 

ASF GitHub Bot commented on LANG-1406:
--------------------------------------

Github user kinow commented on the issue:

    https://github.com/apache/commons-lang/pull/340
  
    Oh, that does make sense now. So the first visible character we see is the ["Latin Capital
Letter I with Dot Above"](https://unicode-table.com/en/#0130) (see also [this other link](https://en.wikipedia.org/wiki/Dotted_and_dotless_I)),
and the second an `x`. And doing `toUpperCase()` simply won't change it as it's considered
already upper case.
    
    When doing a `toLowerCase`, it gets translated into two visible characters. The second
is the normal `x`. While the first contains two codepoints. I tested in Python, and got the
lower case `i` (`print(u"\u0069")`) followed by a character invisible by itself (`print (u"\u0307")`).
    
    The special/invisible character, is visible when coming after certain letters.
    
    ```python
    >>> print(u"\u0307")
    
    >>> print(u"\u0069\u0307")
    i̇
    >>> print(u"\u0068\u0307")
    ḣ
    >>> print(u"\u0067\u0307")
    ġ
    >>> print(u"\u0067\u0307")
    ```
    
    When we get these invisible characters, as we have one code point more, the length returned
is not 2, but 3. Resulting in exception in this issue.
    
    I don't believe the fix here would fix the reverse case, where we had a lower case, single
codepoint, unicode; that would be represented by a two code codepoint. The exception could
happen again (I haven't investigated whether such case exist, but I'm assuming there could
be such case - if not now, maybe a character could still be added in future editions).
    
    What do you think @HiuKwok ? Any suggestions? I'm not sure if there's any easy way to
fix this case, except by adding a note to the documentation saying that the method is not
intended to be used with unicode strings, as it doesn't handle supplementary characters well.
Or maybe we could try to remove the `length()` call around the `StringBuilder`'s near the
end of the method...


> StringIndexOutOfBoundsException in StringUtils.replaceIgnoreCase
> ----------------------------------------------------------------
>
>                 Key: LANG-1406
>                 URL: https://issues.apache.org/jira/browse/LANG-1406
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*
>            Reporter: Michael Ryan
>            Priority: Major
>
> STEPS TO REPRODUCE:
> {code}
> StringUtils.replaceIgnoreCase("\u0130x", "x", "")
> {code}
> EXPECTED: "\u0130" is returned.
> ACTUAL: StringIndexOutOfBoundsException
> This happens because the replace method is assuming that text.length() == text.toLowerCase().length(),
which is not true for certain characters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message