commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Tompkins <>
Subject Re: [TEXT-2] Add Jaccard Index and Jaccard Distance
Date Thu, 17 Nov 2016 14:15:10 GMT
Hello Don,

Just as an FYI I just added an interface that is a weakening of the full "metric" mathematical
definition called "SimilarityScore" (mainly for the JaroWinkler distance), but as this satisfies
the triangle inequality and all of the other metric axioms it should implement EditDistance,
which is intended to represent string comparisons that fully satisfy the definition of a metric.


> On Nov 16, 2016, at 11:06 AM, don jeba <> wrote:
> Hello,I am planning to work on this ticket TEXT-2. I need your guidance on naming/placing
the class file for implementing this.
> The ask in the ticket is to get Jaccard Index [measures similarity] and Jaccard Distance
[measures dissimilarity].
> Below is what I am planning to do.
> Add a new class JaccardBase under package org.apache.commons.text, this will have logic
to calculate both the index and distance. As you know Jaccard distance is 1- jaccard index,
so there is no separate logic for each of it (index and distance), so planning to keep the
calculation logic in a common place.
> Add a new class JaccardIndex under package org.apache.commons.text.similarity, this class
will be derived from JaccardBase and the class JaccardIndex will expose public function to
get the jaccard index.
> Similar to the above a new class JaccardDistance under package org.apache.commons.text.diff,
this class will be derived from JaccardBase and the class JaccardDistance will expose public
function to get the jaccard distance.
> The advantage is there is no code duplication.The disadvantage is, the caller wants both
the index and distance then, he/she needs to call 2 separate functions (one from JaccardIndex
class and one from JaccardDistance class) and we need to do the calculation twice for the
same set of input.
> Another option is, have a single class which will return both the index and distance.With
this option, I have 2 questions1 where to keep the new class (under which package)2 what should
be the name the new class.The disadvantage is option 1 is fixed here.
> I personally prefer option 1 as it looks more clean considering the way the classes are
arranged in the package.
> Can you kindly review and comment on your thought.
> Do let me know if I am not clear.
> Thank you,
> Regards,Don Jeba.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message