lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Baxter <>
Subject Re: Lucene's use of one byte to encode document length
Date Tue, 14 Jan 2003 22:10:25 GMT
I didn't realise document-length-precision was that unimportant for 
ranking. What does Google do? If they pull 1 byte per document into  
memory then - at least according to their claim for the number of 
documents indexed -  that's over 3G. I can't see them equipping their 
10,000 linux machines with more than 3G memory each.

Apologies if this is off-topic for this list.



On Wednesday 15 January 2003 04:21, Doug Cutting wrote:
> Jonathan Baxter wrote:
> > How important is it for I/O performance that Lucene uses only one
> > byte to represent document length? Or are there reasons other
> > than performance for using so few bits?
> To achieve good search performance, field-length normalization
> factors must be memory-resident.  So not only must the entire
> contents of these files be read when searching, it must also be
> kept in memory.  With the one byte encoding this means that Lucene
> requires a byte per indexed field per document.  So a 10M document
> collection with five fields requires 50Mb of memory to be searched.
>  Doubling these to two bytes would double this memory requirement. 
> Is that acceptable?  It depends on who you ask.
> Why do you find this insufficient?  The one byte float format (used
> in the current, unreleased sources) can actually represent a large
> range of values.  Its precision is low, but high-precision isn't
> usually required for length normalization or Google-style boosting.
> Are you trying to use this for some other purpose in your ranking?
> Doug

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message