commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Herbert <alex.d.herb...@gmail.com>
Subject Re: [Text, Lang] Matching two CharSequence instances
Date Sat, 02 Mar 2019 20:49:06 GMT

> On 2 Mar 2019, at 16:59, Mark Dacek <mark@syberion.com> wrote:
> 
> Is your proposed method a stepwise charAt comparison across both, assuming
> non-null and equal length?

Yes. Although the StringUtils.equals(CharSequence, CharSequence) from [lang] will do the job
correctly (thanks Gary). It currently does all the edge case checks then calls a region matching
method using the entire length but the effect is the same as:

for (int i = 0; i < cs1.length(); i++) {
    if (cs1.charAt(i) != cs2.charAt(i)) {
        return false;
    }
}
return true;

Switching in the above code instead of the call to regionMatches(…) at the end of StringUtils.equals(CharSequence,
CharSequence) would avoid repeating all the edge case checks of length at the start of that
method and the case insensitivity functionality. 

The StringUtils.equals method already detects if String is input as both arguments and defaults
to that if possible. So this is basically for any other combination of CharSequence types
where a simple stepwise charAt comparison is wanted.


> Doesn't seem like a bad idea, though I'm curious whether there's a use-case
> where toString() on both and comparing isn't more expedient.

Just the memory overhead of duplicating to create a String. If a match is unlikely, especially
near the start, then this is a cost to consider for longer strings.

I was just after something to put in place of the incorrect usage of:

CharSequence cs1, cs2;
cs1.equals(cs2);

Which is not part of the CharSequence interface and works only if inputting 2 objects that
support equals correctly, like String or StringBuilder.

I’ve just has a look for .equals() in all of [text] and this is actually a bug that is in
the newly submitted JaroWinklerSimilarity too. 

I’ll do a PR to fix that one.

> 
> On Sat, Mar 2, 2019 at 11:53 AM Alex Herbert <alex.d.herbert@gmail.com>
> wrote:
> 
>> I am helping with the PR for TEXT-126 to add to the similarity package.
>> 
>> Part of the new algorithm requires identifying if two CharSequences are
>> identical. Is there a utility in Text to do something like this:
>> 
>> public static boolean CharSequenceUtils.equals(CharSequence, CharSequence);
>> 
>> I cannot find one with a quick regex search of the library. I am not
>> familiar with Lang either but this is a dependency so a method from there
>> could be used.
>> 
>> The current PR is using left.equals(right) on the input CharSequence to
>> compare to one to another which is wrong if the two input CharSequences do
>> not support matching, e.g. if the input was a String and StringBuilder then
>> String.equals(StringBuilder) would not match, even if the characters were
>> the same.
>> 
>> Regards,
>> 
>> Alex
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message