lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Lucene's use of one byte to encode document length
Date Tue, 14 Jan 2003 17:51:45 GMT
Jonathan Baxter wrote:
> How important is it for I/O performance that Lucene uses only one byte 
> to represent document length? Or are there reasons other than 
> performance for using so few bits?

To achieve good search performance, field-length normalization factors 
must be memory-resident.  So not only must the entire contents of these 
files be read when searching, it must also be kept in memory.  With the 
one byte encoding this means that Lucene requires a byte per indexed 
field per document.  So a 10M document collection with five fields 
requires 50Mb of memory to be searched.  Doubling these to two bytes 
would double this memory requirement.  Is that acceptable?  It depends 
on who you ask.

Why do you find this insufficient?  The one byte float format (used in 
the current, unreleased sources) can actually represent a large range of 
values.  Its precision is low, but high-precision isn't usually required 
for length normalization or Google-style boosting.

Are you trying to use this for some other purpose in your ranking?


To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message