lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8011) Improve similarity explanations
Date Fri, 01 Dec 2017 08:38:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274115#comment-16274115
] 

ASF GitHub Bot commented on LUCENE-8011:
----------------------------------------

Github user jpountz commented on the issue:

    https://github.com/apache/lucene-solr/pull/280
  
    Thanks @mayya-sharipova, this looks like great progress to me. Maybe we could go even
further and do the following:
     - in the Axiomatic similarity, add abstract methods to allow sub classes to explain how
tf, ln, etc. are computed,
     - make BasicModel.explain abstract to force sub classes to have their own explanation
and include the formula,
     - make sure that our own sub classes of SimilarityBase extend explain (the one that returns
an explanation) and include the formula in the explanation.
    
    For the record, there is not too much concern to have about backward compatibility since
most of those classes (eg. Axiomatic, BasicModel) are very expert classes and this changes
targets master.


> Improve similarity explanations
> -------------------------------
>
>                 Key: LUCENE-8011
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8011
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>              Labels: newdev
>
> LUCENE-7997 improves BM25 and Classic explains to better explain:
> {noformat}
> product of:
>   2.2 = scaling factor, k1 + 1
>   9.388654 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
>     1.0 = n, number of documents containing term
>     17927.0 = N, total number of documents with field
>   0.9987758 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
>     979.0 = freq, occurrences of term within document
>     1.2 = k1, term saturation parameter
>     0.75 = b, length normalization parameter
>     1.0 = dl, length of field
>     1.0 = avgdl, average length of field
> {noformat}
> Previously it was pretty cryptic and used confusing terminology like docCount/docFreq
without explanation: 
> {noformat}
> product of:
>   0.016547536 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))
from:
>     449.0 = docFreq
>     456.0 = docCount
>   2.1920826 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength
/ avgFieldLength)) from:
>     113659.0 = freq=113658
>     1.2 = parameter k1
>     0.75 = parameter b
>     2300.5593 = avgFieldLength
>     1048600.0 = fieldLength
> {noformat}
> We should fix other similarities too in the same way, they should be more practical.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message