commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno P. Kinoshita" <brunodepau...@yahoo.com.br.INVALID>
Subject Re: [Text] JaccardSimilarity
Date Fri, 08 Mar 2019 00:01:54 GMT
 >I’d favour dropping the round and adding it to the Changes.xml via a Jira ticket so
it is noted if someone upgrades. They can always restore functionality to as-it-was by doing
a round on the output of the class. 
+1
>I’ve already made the test using the python distance.jaccard function from the distance
library in the PR for Text-155. So changing the test is simple. It’s just the decision on
whether to do it.
I think we can aim at implementing this for 1.7 (which from the looks of it will have several
bug fixes & improvements!).
CheersBruno


    On Friday, 8 March 2019, 10:54:32 am NZDT, Alex Herbert <alex.d.herbert@gmail.com>
wrote:  
 
 Hi Bruno,

> On 7 Mar 2019, at 21:18, Bruno P. Kinoshita <kinow@apache.org> wrote:
> 
> Hi Alex,
> Can't recall why it was done that way. When the initial code for the edit distances was
created, some Java libraries like Simmetrics, java-string-similarity, Lucene, and also R/Python
code were used to verify the output of the edit distances.
> Maybe we used Math.round just to get a test passing, which I agree it had to be documented.
> But even better if we just drop the Math.round and instead update the tests with that
assertEquals(expected, actual, threshold) method, with a good enough threshold.
> What do you think?

I’d favour dropping the round and adding it to the Changes.xml via a Jira ticket so it is
noted if someone upgrades. They can always restore functionality to as-it-was by doing a round
on the output of the class. 

If I understand the metric correctly (intersect over union) to have a difference in the 3rd
decimal place would require the union of the two character sets to be above 200, i.e. a string
containing over 200 unique characters, e.g. 

A) 0/200 = 0
B) 1/200 = 0.005
C) 2/200 = 0.01

In this case result A and C can be distinguished but not B and C due to round up.

So in practical terms it would not make a difference unless using a large character set. For
ASCII strings there is no difference.

I’ve already made the test using the python distance.jaccard function from the distance
library in the PR for Text-155. So changing the test is simple. It’s just the decision on
whether to do it.

Alex


> CheersBruno
> 
>    On Friday, 8 March 2019, 4:49:52 am NZDT, Alex Herbert <alex.d.herbert@gmail.com>
wrote:  
> 
> A quick question about the JaccardSimilarity class:
> 
> Q. Why does it round the similarity to 2 decimal places?
> 
> This is not documented.
> 
> It is also done in the complimentary JaccardDistance class.
> 
> Looking at the history in git it seems to have always been that way. 
> First commit was 2016-11-27.
> 
> Thanks,
> 
> Alex
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message