lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?
Date Thu, 19 Oct 2017 15:21:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211193#comment-16211193
] 

Robert Muir commented on LUCENE-8000:
-------------------------------------

I don't think we should disable discountOverlaps:
The reason is that there are too many commonly-used tokenfilters adding synonyms or similar
and they will bias document lengths. I've done measurements here, and that's why i originally
proposed enabling it by default (the option was there, but was disabled by default).

average document length will never be exact either (due to deleted documents and many other
reasons). norm is inexact since its a single byte too. Ultimately this average is just a pivot,
it doesn't need to be pedantically correct. and we shouldn't make relevance worse for no good
reason.

if you have a different/special use-case, you can disable discountOverlaps yourself, that's
why the option is there.

> Document Length Normalization in BM25Similarity correct?
> --------------------------------------------------------
>
>                 Key: LUCENE-8000
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8000
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Christoph Goller
>            Priority: Minor
>
> Length of individual documents only counts the number of positions of a document since
discountOverlaps defaults to true.
>  {quote} @Override
>   public final long computeNorm(FieldInvertState state) {
>     final int numTerms = discountOverlaps ? state.getLength() - state.getNumOverlap()
: state.getLength();
>     int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
>     if (indexCreatedVersionMajor >= 7) {
>       return SmallFloat.intToByte4(numTerms);
>     } else {
>       return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
>     }
>   }{quote}
> Measureing document length this way seems perfectly ok for me. What bothers me is that
> average document length is based on sumTotalTermFreq for a field. As far as I understand
that sums up totalTermFreqs for all terms of a field, therefore counting positions of terms
including those that overlap.
> {quote}  protected float avgFieldLength(CollectionStatistics collectionStats) {
>     final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
>     if (sumTotalTermFreq <= 0) {
>       return 1f;       // field does not exist, or stat is unsupported
>     } else {
>       final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc()
: collectionStats.docCount();
>       return (float) (sumTotalTermFreq / (double) docCount);
>     }
>   }{quote}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious effect. It
just means that documents that have synonyms or in our case different normal forms of tokens
on the same position are shorter and therefore get higher scores  than they should and that
we do not use the whole spectrum of relative document lenght of BM25.
> I think for BM25  discountOverlaps  should default to false. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message