commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kinow <>
Subject [GitHub] commons-text pull request #85: Add optimization to limited levenshtein dista...
Date Sat, 21 Jul 2018 11:27:41 GMT
Github user kinow commented on a diff in the pull request:
    --- Diff: src/main/java/org/apache/commons/text/similarity/ ---
    @@ -270,12 +270,6 @@ private static int limitedCompare(CharSequence left, CharSequence
right, final i
                 final int max = j > Integer.MAX_VALUE - threshold ? n : Math.min(
                         n, j + threshold);
    -            // the stripe may lead off of the table if s and t are of different
    -            // sizes
    -            if (min > max) {
    --- End diff --
    A good catch I think @luciano-medallia . @chtompki here's what I understood from reading
the code and debugging a bit.
    If you have the strings s1="papa" and s2="papakura", and you use the threshold 3, the
algorithm won't initiate, and we will fail earlier.
    That's because the we are saying that any edit distance under the threshold (3) must be
ignored, and we would now be checking that the difference between the length of both (4) is
greater than the threshold.
    So as the difference between both is 4, and it is greater than our threshold of 3, there's
no need to enter the main algorithm.
    If we increase the threshold to 4, then we get the distance of 4 as it's ok according
to the threshold in our code.
    The part of the code removed I assumed was a constraint removed by this new check. Also
debugged with different values and couldn't find a combination to trigger that if. So I think
the change is OK.
    All tests passed locally too.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message