lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christoph Goller (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?
Date Fri, 20 Oct 2017 08:40:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212350#comment-16212350
] 

Christoph Goller edited comment on LUCENE-8000 at 10/20/17 8:39 AM:
--------------------------------------------------------------------

??My point is that defaults are for typical use-cases, and the default of discountOverlaps
meets that goal. It results in better (measured) performance for many tokenfilters that are
commonly used such as common-grams, WDF, synonyms, etc. I ran these tests before proposing
the default, it was not done flying blind.??

Understood. *I have not experienced any problems with the current default* and I have the
option to set discountOverlaps to false. Therefore it's ok for me if the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses discountOverlaps
= true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps = true. My
naive guess would be that since stopwords or synonyms are either used on all documents or
on none and therefore it should not make much difference whether we count overlaps or not.
Is the explanation that for some documents many stopwords / synonyms / WDF splits are used
and for others not (for the same field). Another possible explanation would be that some fields
have synonyms and others have not. That would punish fields with synonyms compared to others
since their length is greater (in Classic Similarity with discountOverlaps = false), but in
BM25 it should not have this effect since BM25 used relative lenght for scoring and not abolute
length.

Sorry for bothering you with these questions. It's only my curiosity and maybe Jira is nto
the right place for this.



was (Author: goller@detego-software.de):
??My point is that defaults are for typical use-cases, and the default of discountOverlaps
meets that goal. It results in better (measured) performance for many tokenfilters that are
commonly used such as common-grams, WDF, synonyms, etc. I ran these tests before proposing
the default, it was not done flying blind.??

Understood. *I have not experienced any problems with the current default* and I have the
option to set discountOverlaps to false. Therefore it's ok for me if the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses discountOverlaps
= true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps = true. My
naive guess would be that since stopwords or synonyms are either used on all documents or
on none and therefore it should not make much difference whether we count overlaps or not.
Is the explanation that for some documents many stopwords / synonyms / WDF splits are used
and for others not (for the same field).

Sorry for bothering you with these questions. It's only my curiosity and maybe Jira is nto
the right place for this.


> Document Length Normalization in BM25Similarity correct?
> --------------------------------------------------------
>
>                 Key: LUCENE-8000
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8000
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Christoph Goller
>            Priority: Minor
>
> Length of individual documents only counts the number of positions of a document since
discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
>     final int numTerms = discountOverlaps ? state.getLength() - state.getNumOverlap()
: state.getLength();
>     int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
>     if (indexCreatedVersionMajor >= 7) {
>       return SmallFloat.intToByte4(numTerms);
>     } else {
>       return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
>     }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers me is that
> average document length is based on sumTotalTermFreq for a field. As far as I understand
that sums up totalTermFreqs for all terms of a field, therefore counting positions of terms
including those that overlap.
> {code}
>  protected float avgFieldLength(CollectionStatistics collectionStats) {
>     final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
>     if (sumTotalTermFreq <= 0) {
>       return 1f;       // field does not exist, or stat is unsupported
>     } else {
>       final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc()
: collectionStats.docCount();
>       return (float) (sumTotalTermFreq / (double) docCount);
>     }
>   }
> }
> {code}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious effect. It
just means that documents that have synonyms or in my use case different normal forms of tokens
on the same position are shorter and therefore get higher scores  than they should and that
we do not use the whole spectrum of relative document lenght of BM25.
> I think for BM25  discountOverlaps  should default to false. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message