lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chong-Ki Tsang" <>
Subject Lucene's scoring algorithm
Date Fri, 18 Jul 2003 05:26:20 GMT
I am curious to know whether the Lucene's scoring algorithm was updated
in the latest 1.3 version.

I find the following scoring algorithm in the Similarity class of JAVA
API documents. This method is different from the one shown in official
FAQ. Could you tell me which one is being used in 1.3? If the algorithm
was updated, please send me the formula. I will appreciate that.





The score of query q for document d is defined in terms of these methods
as follows: 

score(q,d) =


larity.html#tf(int)> tf(t in d) *
larity.html#idf(org.apache.lucene.index.Term,> idf(t) *
eld.html#getBoost()> getBoost(t.field in d) *
larity.html#lengthNorm(java.lang.String, int)> lengthNorm(t.field in d) 

larity.html#coord(int, int)> coord(q,d) *
larity.html#queryNorm(float)> queryNorm(q) 

t in q



For the official FAQ, Lucene's scoring algorithm is shown as,


31. How does Lucene assigns scores to hits ?

Here is a quote from Doug himself (posted on July 2001 to the Lucene
users mailing list): 


For the record, Lucene's scoring algorithm is, roughly:


  score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t)



  score_d   : score for document d

  sum_t     : sum for all terms t

  tf_q      : the square root of the frequency of t in the query

  tf_d      : the square root of the frequency of t in d

  idf_t     : log(numDocs/docFreq_t+1) + 1.0

  numDocs   : number of documents in index

  docFreq_t : number of documents containing t

  norm_q    : sqrt(sum_t((tf_q*idf_t)^2))

  norm_d_t  : square root of number of tokens in d in the same field as


(I hope that's right!)


[Doug later added...]


Make that:


  score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d




  boost_t    : the user-specified boost for term t

  coord_q_d  : number of terms in both query and document / number of
terms in query


The coordination factor gives an AND-like boost to documents that

e.g., all three terms in a three word query over those that contain just

of the words.
earch&toc=faq#q31> &toc=faq#q31



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message