lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Goller <>
Subject Re: DocumentWriter.writeNorms : the way to compute the normalisation factor
Date Thu, 15 Apr 2004 13:34:44 GMT
Phil brunet wrote:
> Hi to all.
> In the DocumentWriter.writeNorms(Document doc, String segment) method  
> (Lucene V1.3)
> i wonder if there is a special reason to compute the normalisation 
> factor base upon the number of tokens contained in the document (using 
> fieldLengths array) instead of computing it using the number of 
> positions (filedPositions array).
> I think in most of case, the difference is not significant.So using 
> fieldLengths or using filedPositions are equivallent. But i would like 
> to be sure of it.
> So, if anybody has an opinion ...
> Thanks
> Phil
> Nota bene:
> =======
> If i understood correctly, the fieldLength value and the fieldPosition 
> value are different for a given document if and only if the document 
> contains at least one token with an increment set to 0.
> In my case, such a token should not be compted in the normalisation 
> factor. cause i need this factor to be exactly in inverse proportion of 
> the number OF DIFFERENT tokens (i.e. ignoring those with increment set 
> to 0).

This issue was discussed a couple of weeks ago. It seems that some folks use
rather big position increments in order to identify sentence and paragraph
boundaries. Note that positions are currently used only by PhraseQueries and
we do not want a PhraseQuery to match in the gap between sentences and
paragraphs ..... However, this means that the number of positions and the
number of tokens may vary considerably.

Maybe you can solve your problem witrh the new IndexReader.setNorm.
Unfortunately, this means that you have to stop indexing, close your
writer, and open an IndexReader .....
Not very comfortable ....


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message