commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Herbert <alex.d.herb...@gmail.com>
Subject Re: [Text, Lang] Matching two CharSequence instances
Date Sun, 03 Mar 2019 07:58:10 GMT


> On 3 Mar 2019, at 02:00, Bruno P. Kinoshita <kinow@apache.org> wrote:
> 
> Hi Alex,
> Also found the two implementations similar, but confusing.
> - The parameter name inconsistency
> +1
> 
> 
> - The edge case logic inconsistency
> 
> +1
> 
> - Change to a stepwise charAt comparison
> +1 if the behaviour is kept. Right now some crazy characters like those old latin letters
are supported (e.g. LATIN CAPITAL LIGATURE IJ), and the result for equal and equal ignore
case are as expected.
> And right now those methods do not work with surrogate characters (e.g. GEORGIAN LETTER
AN and GEORGIAN CAPITAL LETTER AN are false for equals and equals ignore case). But this is
an edge case that we can ignore for now I believe.

I’ve tested it locally and all units tests are fine. So I’ll PR when I have time.

> CheersBruno
> 
>    On Sunday, 3 March 2019, 1:29:12 pm NZDT, Alex Herbert <alex.d.herbert@gmail.com>
wrote:  
> 
> Having looked a bit more at StringUtils it appears that:
> 
> public static boolean equals(final CharSequence cs1, final CharSequence cs2);
> public static boolean equalsIgnoreCase(final CharSequence str1, final CharSequence str2);
> 
> share edge case logic checking but it is implemented differently (although with the same
effect). They also have inconsistent parameter names (cs1 vs str1).
> 
> The equals method then ends up calling CharSequenceUtils.regionMatches but without the
ignore case option. This method does a lot more work than necessary as it performs a lot of
edge case checks then loops with regard to checking for case. It would be cleaner to mimic
String.equals by using a stepwise charAt comparison.
> 
> Any objections to a PR to fix:
> 
> - The parameter name inconsistency
> - The edge case logic inconsistency
> - Change to a stepwise charAt comparison
> 
> Alex
> 
>> On 2 Mar 2019, at 20:49, Alex Herbert <alex.d.herbert@gmail.com> wrote:
>> 
>> 
>>> On 2 Mar 2019, at 16:59, Mark Dacek <mark@syberion.com <mailto:mark@syberion.com>>
wrote:
>>> 
>>> Is your proposed method a stepwise charAt comparison across both, assuming
>>> non-null and equal length?
>> 
>> Yes. Although the StringUtils.equals(CharSequence, CharSequence) from [lang] will
do the job correctly (thanks Gary). It currently does all the edge case checks then calls
a region matching method using the entire length but the effect is the same as:
>> 
>> for (int i = 0; i < cs1.length(); i++) {
>>     if (cs1.charAt(i) != cs2.charAt(i)) {
>>         return false;
>>     }
>> }
>> return true;
>> 
>> Switching in the above code instead of the call to regionMatches(…) at the end
of StringUtils.equals(CharSequence, CharSequence) would avoid repeating all the edge case
checks of length at the start of that method and the case insensitivity functionality. 
>> 
>> The StringUtils.equals method already detects if String is input as both arguments
and defaults to that if possible. So this is basically for any other combination of CharSequence
types where a simple stepwise charAt comparison is wanted.
>> 
>> 
>>> Doesn't seem like a bad idea, though I'm curious whether there's a use-case
>>> where toString() on both and comparing isn't more expedient.
>> 
>> Just the memory overhead of duplicating to create a String. If a match is unlikely,
especially near the start, then this is a cost to consider for longer strings.
>> 
>> I was just after something to put in place of the incorrect usage of:
>> 
>> CharSequence cs1, cs2;
>> cs1.equals(cs2);
>> 
>> Which is not part of the CharSequence interface and works only if inputting 2 objects
that support equals correctly, like String or StringBuilder.
>> 
>> I’ve just has a look for .equals() in all of [text] and this is actually a bug
that is in the newly submitted JaroWinklerSimilarity too. 
>> 
>> I’ll do a PR to fix that one.
>> 
>>> 
>>> On Sat, Mar 2, 2019 at 11:53 AM Alex Herbert <alex.d.herbert@gmail.com <mailto:alex.d.herbert@gmail.com>>
>>> wrote:
>>> 
>>>> I am helping with the PR for TEXT-126 to add to the similarity package.
>>>> 
>>>> Part of the new algorithm requires identifying if two CharSequences are
>>>> identical. Is there a utility in Text to do something like this:
>>>> 
>>>> public static boolean CharSequenceUtils.equals(CharSequence, CharSequence);
>>>> 
>>>> I cannot find one with a quick regex search of the library. I am not
>>>> familiar with Lang either but this is a dependency so a method from there
>>>> could be used.
>>>> 
>>>> The current PR is using left.equals(right) on the input CharSequence to
>>>> compare to one to another which is wrong if the two input CharSequences do
>>>> not support matching, e.g. if the input was a String and StringBuilder then
>>>> String.equals(StringBuilder) would not match, even if the characters were
>>>> the same.
>>>> 
>>>> Regards,
>>>> 
>>>> Alex
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org <mailto:dev-unsubscribe@commons.apache.org>
>>>> For additional commands, e-mail: dev-help@commons.apache.org <mailto:dev-help@commons.apache.org>
>>>> 
>>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message