commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Work logged] (TEXT-126) Dice's Coefficient Algorithm in String similarity
Date Wed, 06 Mar 2019 22:59:00 GMT

     [ https://issues.apache.org/jira/browse/TEXT-126?focusedWorklogId=209220&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-209220
]

ASF GitHub Bot logged work on TEXT-126:
---------------------------------------

                Author: ASF GitHub Bot
            Created on: 06/Mar/19 22:58
            Start Date: 06/Mar/19 22:58
    Worklog Time Spent: 10m 
      Work Description: aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity
algoritham
URL: https://github.com/apache/commons-text/pull/103#issuecomment-470311464
 
 
   Hi @ameyjadiye,
   
   The code is now a good working implementation. However it should be noted that the Sorensen-Dice
similarity is a binary scoring metric. It can be applied to anything where you have two sets
to be matched. It is closely related to the Jaccard score.
   
   This PR brings in a new algorithm to compute matches using pairs of characters (bigrams).
This is a conflict with existing similarity measures in the package implementing `SimilarityScore<R>`:
   
   JaccardSimilarity - uses `Set<Character>` for matching single characters to produce
`Double`
   JaroWinklerSimilarity - uses single character matching to produce an edit distance which
is then converted to a similarity `Double`
   LongestCommonSubsequence - uses single character matching for sub-sequences to produce
`Integer`
   
   The rest of the package is for an extension of `SimilarityScore<R>` which is `EditDistance<R>`:
   
   CosineDistance - uses a `RegexTokenizer` on whitespace to match words to produce `Double`
   HammingDistance - uses single character matching to produce `Integer`
   JaccardDistance - just inverts the JaccardSimilarity to produce `Double`
   JaroWinklerDistance - complementary of Jaro-Winkler similarity to produce `Double`
   LevenshteinDetailedDistance - single character changes to produce `LevenshteinResults`
   LevenshteinDistance - single character changes to produce `Integer`
   LongestCommonSubsequenceDistance - single character based substring to produce `Double`
   
   **So all the other measures use single characters or words** (CosineDistance).
   
   If you want a Sorensen-Dice distance score using single characters then just use the `JaccardDistance`
and map it: `S = 2J / (1 + J)`
   
   This class is effectively a JaccardDistance with `bigrams` but with a mapped output score.
   
   One suggestion would be a change to be something like `BigramJaccardSimilarity` and then
compute the Jaccard using bigrams. Then put new classes in for `SorensenDiceSimilarity` and
`BiSorensenDiceSimilarity`.
   
   However the current JaccardDistance uses Set<Character> which has space and efficiency
advantages over Set<String>. Once you are using strings then there is no reason to have
only bigrams.
   
   My preference would be to add a new class `TokenizerJaccardSimilarity` (or something nicer)
that accepts a `Tokenizer<CharSequence>` in the constructor. This then tokenises the
input into two sets and computes the union of the two sets.
   
   Your `bigram` case can be fulfilled by writing a new `Tokenizer<CharSequence>` that
returns the bigrams. This new class would be flexible for computing the Jaccard for words,
bigrams, trigrams, n-grams, or whatever else you can tokenise a `CharSequence` into.
   
   Then you add a `TokenSorensonDiceSimilarity` that just maps the Jaccard score. You can
see this for example in the `JaccardDistance` class.
   
   @kinow Opinions on a more flexible approach?
   
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 209220)
    Time Spent: 8.5h  (was: 8h 20m)

> Dice's Coefficient Algorithm in String similarity
> -------------------------------------------------
>
>                 Key: TEXT-126
>                 URL: https://issues.apache.org/jira/browse/TEXT-126
>             Project: Commons Text
>          Issue Type: Improvement
>            Reporter: Vicky Chawda
>            Priority: Major
>          Time Spent: 8.5h
>  Remaining Estimate: 0h
>
> I'd like to propose an extension to the algorithms for string similarity in *commons-text/src/main/java/org/apache/commons/text/similarity/*
>  Dice's Coefficient Algorithm can be helpful for many who are looking for ranking similarities
in strings.
> *Inspired from* - [http://www.catalysoft.com/articles/StrikeAMatch.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message