commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Work logged] (TEXT-155) Add a generic SetSimilarity measure
Date Thu, 07 Mar 2019 21:35:00 GMT

     [ https://issues.apache.org/jira/browse/TEXT-155?focusedWorklogId=209793&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-209793
]

ASF GitHub Bot logged work on TEXT-155:
---------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Mar/19 21:34
            Start Date: 07/Mar/19 21:34
    Worklog Time Spent: 10m 
      Work Description: kinow commented on issue #109: TEXT-155: Add a generic IntersectionSimilarity
measure
URL: https://github.com/apache/commons-text/pull/109#issuecomment-470703522
 
 
   @aherbert I will have another play with the code later with more time. Another library
also implemented [helper class/method for the intersection](https://github.com/Simmetrics/simmetrics/blob/59dc148f402da6a8a82ad8604a64fa35d1f70460/simmetrics-core/src/main/java/org/simmetrics/metrics/Math.java).
I think the design here looks similar.
   
   However, I think it would make more sense to have the `IntersectionResult` being used in
other metrics. 
   
   Wouldn't it be possible to use `IntersectionResult` in the Jaccard and even in the new
Sorensen-Dice metrics?
   
   We can leave the `IntersectionSimilarity` but maybe use it as an internal or package protected
class? Moving the F1 score and Jaccard to its own classes (in the Jaccard case, I believe
it means replacing the code in the existing `JaccardSimilarity` by `IntersectionResult` +
`IntersectionSimilarity`, then in the return of the `JaccardSimilarity#apply` simply have
the code we have now in `IntersectionResult#getJaccard` ).
   
   What do you think?
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 209793)
    Time Spent: 0.5h  (was: 20m)

> Add a generic SetSimilarity measure
> -----------------------------------
>
>                 Key: TEXT-155
>                 URL: https://issues.apache.org/jira/browse/TEXT-155
>             Project: Commons Text
>          Issue Type: New Feature
>    Affects Versions: 1.6
>            Reporter: Alex D Herbert
>            Priority: Minor
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The {{SimilarityScore<T>}} interface can be used to compute a generic result. I
propose to add a class that can compute the intersection between two sets formed from the
characters. The sets must be formed from the {{CharSequence}} input to the {{apply}} method
using a {{Function<CharSequence, Set<T>>}} to convert the {{CharSequence}}. This
function can be passed to the {{SimilarityScore<T>}} during construction.
> The result can then be computed to have the size of each set and the intersection.
> I have created an implementation that can compute the equivalent of the {{JaccardSimilary}}
class by creating {{Set<Character>}} and also the F1-score using bigrams (pairs of characters)
by creating {{Set<String>}}. This relates to [Text-126|https://issues.apache.org/jira/projects/TEXT/issues/TEXT-126]
which suggested an algorithm for the Sorensen-Dice similarity, also known as the F1-score.
> Here is an example:
> {code:java}
> // Match the functionality of the JaccardSimilarity class
> Function<CharSequence, Set<Character>> converter = (cs) -> {
>     final Set<Character> set = new HashSet<>();
>     for (int i = 0; i < cs.length(); i++) {
>         set.add(cs.charAt(i));
>     }
>     return set;
> };
> IntersectionSimilarity<Character> similarity = new IntersectionSimilarity<>(converter);
> IntersectionResult result = similarity.apply("something", "something else");
> {code}
> The result has the size of set A, set B and the intersection between them.
> This class was inspired by my look through the various similarity implementations. All
of them except the {{CosineSimilarity}} perform single character matching between the input
{{CharSequence}}s. The {{CosineSimilarity}} tokenises using whitespace to create words.
> This more generic type of implementation will allow a user to determine how to divide
the {{CharSequence}} but to create the sets that are compared, e.g. single characters, words,
bigrams, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message