lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hui Ouyang" <>
Subject RE: Interpreting the score asociated with the Term? |
Date Fri, 31 Jan 2003 15:58:58 GMT
Is it possible to access each individual frequency in the score formula so I could calculate
and show the score for each of the sub query?

	-----Original Message----- 
	From: Otis Gospodnetic [] 
	Sent: Thu 1/23/2003 10:58 AM 
	To: Lucene Users List; 
	Subject: Re: Interpreting the score asociated with the Term? |

	Here is a simplified explanation of some basic stuff.
	1. the more frequent the term (in a collection) the lower its weight
	(significance).  Makes sense - very popular words don't distinguish one
	document from the other much, because they are present in so many docs.
	2. the more frequent a word in a single document, the higher the
	documents 'value' when the query contains that word.  So the score goes
	up for frequent words in a document, esp. if they are not frequent in
	other documents in the collection.
	3. there is a boost factor which allow you to boost certain terms at
	query time (e.g. you value matches in title field more than the body
	field?  boost title field queries)
	4. normalization factor, I believe, normalizes things so that longer
	documents don't have advantage over shorter ones.
	There is more to this....but I am already not 100% about all of the
	above, so I'll stop here :)
	Also note that you can boost fields at index time (you'll have to use
	the nightly build for that instead of the 1.2 release to get this, I
	--- Rishabh Bajpai <> wrote:
	> Hi All,
	> I am using Lucene as a Search Engine for my work. I am new to this,
	> so forgive me if I am asking a cliched question!
	> I need to understand how the SCORE for the search TERMs is calculated
	> for Lucene, so that indexing can be appropriately be designed to
	> return the most relevant results, when searched.
	> On the official FAQ page of the Lucene site, a formula is listed as
	> score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
	> boost_t) * coord_q_d
	> where:
	>   score_d   : score for document d
	>   sum_t     : sum for all terms t
	>   tf_q      : the square root of the frequency of t in the query
	>   tf_d      : the square root of the frequency of t in d
	>   idf_t     : log(numDocs/docFreq_t+1) + 1.0
	>   numDocs   : number of documents in index
	>   docFreq_t : number of documents containing t
	>   norm_q    : sqrt(sum_t((tf_q*idf_t)^2))
	>   norm_d_t  : square root of number of tokens in d in the same field
	> as t
	>   boost_t   : the user-specified boost for term t
	>   coord_q_d : number of terms in both query and document / number of
	> terms in query
	> I didnot find the formula too helpful in figuring out what exactly
	> the score is trying to calculate.
	> I want to know of a logic that can be used for translating this score
	> into something that can be used for determining which Terms are more
	> relevant for a given Search Request.
	> One way would be to just assume that - higher the score, more
	> relveant is the search. But is this assumption really valid? Or are
	> there any possible caveats to this?
	> -Rishabh
	> _____________________________________________________________
	> Get 25MB, POP3, Spam Filtering with LYCOS MAIL PLUS for $19.95/year.
	> --
	> To unsubscribe, e-mail: 
	> <>
	> For additional commands, e-mail:
	> <>
	Do you Yahoo!?
	Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
	To unsubscribe, e-mail:   <>
	For additional commands, e-mail: <>

View raw message