lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Harwood <>
Subject Re: Question regarding sorting and memory consumption in lucene
Date Wed, 15 Oct 2008 06:39:35 GMT
Yes, StringIndex's public fields make life awkward.  Re initialization - I did think you could
try use arrays of byte arrays. First 256 terms can be addressed using just one byte array,
on encountering a 257th term an extra byte array is allocated. References to terms then require
indexing into 2 byte arrays and bit shifting the 2nd byte to produce a comibined short which
can address up to 65k terms held in a term pool. 

When sorting, a fast comparison of 2 values can avoid always indexing into all byte arrays
and shifting to produce a number. Simply comparing entries from the most significant byte
array first can reveal a difference in order, if equal then comparing  bytes from the next
most significant byte array is required and so on. 

Not sure how this would perform compared to simply upgrading whole byte arrays to shorts to
ints as you go. 


On 15 Oct 2008, at 00:56, Chris Hostetter <> wrote:

: Actually looking at this a little deeper maybe Lucene could/should 
: automatically be doing this "short" optimisation here?

At the moment it can't, the array's in StringIndex are public.

The other thing that would be a bit tricky is the initialization ... i 
can't think of any easy way to know in advance how many terms there are 
before iterating over all the terms, so you'd have to assume one and then 
if you're wrong copy to the other -- not sure how expensive thta copy 
would be.

It's a little more feasible for custom clients to do when they know in 
advance how many terms they've got -- but some of the existing 
FieldCacheImpl code could probably be refactoredto make it easier on 


To unsubscribe, e-mail:
For additional commands, e-mail:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message