commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno P. Kinoshita" <ki...@apache.org>
Subject Re: [Text, Lang] Matching two CharSequence instances
Date Sun, 03 Mar 2019 02:00:56 GMT
 Hi Alex,
Also found the two implementations similar, but confusing.
- The parameter name inconsistency
+1


- The edge case logic inconsistency

+1

- Change to a stepwise charAt comparison
+1 if the behaviour is kept. Right now some crazy characters like those old latin letters
are supported (e.g. LATIN CAPITAL LIGATURE IJ), and the result for equal and equal ignore
case are as expected.
And right now those methods do not work with surrogate characters (e.g. GEORGIAN LETTER AN
and GEORGIAN CAPITAL LETTER AN are false for equals and equals ignore case). But this is an
edge case that we can ignore for now I believe.
CheersBruno

    On Sunday, 3 March 2019, 1:29:12 pm NZDT, Alex Herbert <alex.d.herbert@gmail.com>
wrote:  
 
 Having looked a bit more at StringUtils it appears that:

public static boolean equals(final CharSequence cs1, final CharSequence cs2);
public static boolean equalsIgnoreCase(final CharSequence str1, final CharSequence str2);

share edge case logic checking but it is implemented differently (although with the same effect).
They also have inconsistent parameter names (cs1 vs str1).

The equals method then ends up calling CharSequenceUtils.regionMatches but without the ignore
case option. This method does a lot more work than necessary as it performs a lot of edge
case checks then loops with regard to checking for case. It would be cleaner to mimic String.equals
by using a stepwise charAt comparison.

Any objections to a PR to fix:

- The parameter name inconsistency
- The edge case logic inconsistency
- Change to a stepwise charAt comparison

Alex

> On 2 Mar 2019, at 20:49, Alex Herbert <alex.d.herbert@gmail.com> wrote:
> 
> 
>> On 2 Mar 2019, at 16:59, Mark Dacek <mark@syberion.com <mailto:mark@syberion.com>>
wrote:
>> 
>> Is your proposed method a stepwise charAt comparison across both, assuming
>> non-null and equal length?
> 
> Yes. Although the StringUtils.equals(CharSequence, CharSequence) from [lang] will do
the job correctly (thanks Gary). It currently does all the edge case checks then calls a region
matching method using the entire length but the effect is the same as:
> 
> for (int i = 0; i < cs1.length(); i++) {
>    if (cs1.charAt(i) != cs2.charAt(i)) {
>        return false;
>    }
> }
> return true;
> 
> Switching in the above code instead of the call to regionMatches(…) at the end of StringUtils.equals(CharSequence,
CharSequence) would avoid repeating all the edge case checks of length at the start of that
method and the case insensitivity functionality. 
> 
> The StringUtils.equals method already detects if String is input as both arguments and
defaults to that if possible. So this is basically for any other combination of CharSequence
types where a simple stepwise charAt comparison is wanted.
> 
> 
>> Doesn't seem like a bad idea, though I'm curious whether there's a use-case
>> where toString() on both and comparing isn't more expedient.
> 
> Just the memory overhead of duplicating to create a String. If a match is unlikely, especially
near the start, then this is a cost to consider for longer strings.
> 
> I was just after something to put in place of the incorrect usage of:
> 
> CharSequence cs1, cs2;
> cs1.equals(cs2);
> 
> Which is not part of the CharSequence interface and works only if inputting 2 objects
that support equals correctly, like String or StringBuilder.
> 
> I’ve just has a look for .equals() in all of [text] and this is actually a bug that
is in the newly submitted JaroWinklerSimilarity too. 
> 
> I’ll do a PR to fix that one.
> 
>> 
>> On Sat, Mar 2, 2019 at 11:53 AM Alex Herbert <alex.d.herbert@gmail.com <mailto:alex.d.herbert@gmail.com>>
>> wrote:
>> 
>>> I am helping with the PR for TEXT-126 to add to the similarity package.
>>> 
>>> Part of the new algorithm requires identifying if two CharSequences are
>>> identical. Is there a utility in Text to do something like this:
>>> 
>>> public static boolean CharSequenceUtils.equals(CharSequence, CharSequence);
>>> 
>>> I cannot find one with a quick regex search of the library. I am not
>>> familiar with Lang either but this is a dependency so a method from there
>>> could be used.
>>> 
>>> The current PR is using left.equals(right) on the input CharSequence to
>>> compare to one to another which is wrong if the two input CharSequences do
>>> not support matching, e.g. if the input was a String and StringBuilder then
>>> String.equals(StringBuilder) would not match, even if the characters were
>>> the same.
>>> 
>>> Regards,
>>> 
>>> Alex
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org <mailto:dev-unsubscribe@commons.apache.org>
>>> For additional commands, e-mail: dev-help@commons.apache.org <mailto:dev-help@commons.apache.org>
>>> 
>>> 
> 
  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message