lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rebecca Watson <>
Subject Re: Why not normalization?
Date Thu, 08 Jul 2010 04:04:27 GMT

> 1) Although Lucene uses tf to calculate scoring it seems to me that term
> frequency has not been normalized. Even if I index several documents, it
> does not normalize tf value. Therefore, since the total number of words
> in index documents are varied, can't there be a fault in Lucene's scoring?

tf = term frequency i.e. the number of times the term appears in the document,
while idf is inverse document frequency - is a measure of how rare a term is,
i.e. related to how many documents the term appears in.

if term1 occurs more frequently in a document i.e. tf is higher, you
want to weight
the document higher when you search for term1

but if term1 is a very frequent term, ie. in lots of documents, then
its probably not
as important to an overall search (where we have term1, term2 etc) so you want
to downweight it (idf comes in)

then the normalisations like length normalisation (allow for 'fair' scoring
across varied field length) come in too.

the tf-idf scoring formula used by lucene is a  scoring method that's
been around
a long long time... there are competing scoring metrics but that's an IR thing
and not an argument you want to start on the lucene lists! :)

these are IR ('information retrieval') concepts and you might want to start by
going to through the tf-idf scoring / some explanations for this kind
of scoring.

> 2) What is the formula to calculate this fieldNorm value?

in terms of how lucene implements its tf-idf scoring - you can see here:

also, the lucene in action book is a really good book if you are starting out
with lucene (and will save you a lot of grief with understanding
lucene / setting
up your application!), it covers all the basics and then moves on to more
advanced stuff and has lots of code examples too:

hope that helps,

bec :)

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message