From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Work logged] (TEXT-126) Dice's Coefficient Algorithm in String similarity
Date Sat, 09 Mar 2019 06:49:00 GMT
[ https://issues.apache.org/jira/browse/TEXT-126?focusedWorklogId=210489&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-210489
ASF GitHub Bot logged work on TEXT-126:
Author: ASF GitHub Bot
Created on: 09/Mar/19 06:48
Start Date: 09/Mar/19 06:48
Worklog Time Spent: 10m
similarity algoritham
URL: https://github.com/apache/commons-text/pull/103#issuecomment-471152055

@kinow @aherbert
I have written code by keeping Wikipedia as standard and researched some other libraries
from other languages just for reference. all of them are using bigrams for calculating similarities.
and I personally think that if  we go further and use triGram, qurterGram ... nGram the resulting
%age would be incorrect. we can't use charachter by charachter match i.e uniGram as that will
also make result bad as ```ab!=ba```

with nGram are we tampering the existing proved algoritham ? if its giving better results
than existing algo I'm ok with that, also does someone really need it in real world examples
?

Wikipedia says

> When taken as a string similarity measure, the coefficient may be calculated for two
strings, x and y using bigrams as follows:[9]
>
>  s=2nt / nx + ny
> where nt is the number of character bigrams found in both strings, nx is the number
of bigrams in string x and ny is the number of bigrams in string y. For example, to calculate
the similarity between:
>
> night
> nacht
> We would find the set of bigrams in each word:
>
> {ni,ig,gh,ht}
> {na,ac,ch,ht}
> Each set has four elements, and the intersection of these two sets has only one element:
ht.
>
> Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.
>

not sure but are we over engineering similarities with #109 ? let me know if there is
practicle use of nGram in real world ? would like to study it more.

Issue Time Tracking
Worklog Id:     (was: 210489)
Time Spent: 10h 20m  (was: 10h 10m)

> Dice's Coefficient Algorithm in String similarity
>
>                 Key: TEXT-126
>                 URL: https://issues.apache.org/jira/browse/TEXT-126
>             Project: Commons Text
>          Issue Type: Improvement
>            Reporter: Vicky Chawda
>            Priority: Major
>          Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> I'd like to propose an extension to the algorithms for string similarity in *commons-text/src/main/java/org/apache/commons/text/similarity/*
>  Dice's Coefficient Algorithm can be helpful for many who are looking for ranking similarities
in strings.
> *Inspired from* - [http://www.catalysoft.com/articles/StrikeAMatch.html]

This message was sent by Atlassian JIRA
(v7.6.3#76005)

