lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <>
Subject Re: Reading terms performance
Date Thu, 05 Sep 2002 19:56:48 GMT
Martin Sevigny wrote:

>Lucene developers,
>If an application using Lucene wants to read the list of values for a
>field, it must use (I think) the IndexReader.terms() method. But this
>method is costly, because it returns all values for all fields, although
>we could want only the values of a field.
If you use the method IndexReader.terms(Term startAt) the enumeration 
will start with the term equal or greater than the one supplied. The 
terms are ordered by field + text, so all terms of a given field come 
together. If you create your initial term with the field you are 
interested in and a "" for text, you will start enumeration with the 
first term of that field. Now, just go through the enum calling next() 
until the returned term has a field other then you are interested in. 
Field names are interned (see String.intern()), so they can be compared 
with == instead of .equals(). This speeds things up a lot.

TermEnums are efficient in that they skip into the term enumeration 
quickly (using an in-memory index of all terms in a given segment, which 
are stored on disk). Also, the TermEnum will read ahead as appropriate 
so that you don't read (much) more than you have to.

Finally, the .terms(Term) differs from the .terms() method in one tricky 
way that can bite you if you are not careful. The TermEnum that is 
returned from .terms() method is positioned *before* the first term, so 
that next() must be called before you can use the enumeration.

However, the TermEnum returned from the terms(Term) method is positioned 
*at* the starting term (which is greater or equal to the term supplied). 
That means that you should start processing before first, and call the 
next() later.

>Are there any tricks here to increase performance? Are there any plans?
>For instance, all field values are stored in a single file for a segment
>(.tis). May be splitting the values in a specifica file per field would
>make it work better?
>The other thing I was wondering is the sorting of these terms. They are
>retrieved in the order according to Java's compareTo() method. It means
>that they are sometimes in alphabetical order (english or english-like
>languages), but not always. Is this ordering really significant in the
>internals of Lucene? Or is it just there for convenience to the
>application developer?
Yes, it is significant for searching. However, if you do not run queries 
against a given field, but just want to use it as a dictionary, the 
terms can have any form. For example "sortprefix:value", so that they 
sort correctly and yet actual values can be extracted.

>I'm asking because we have an application that make los of use of these
>list of terms, for non-english values, and performance in reading the
>values and resorting them is a problem right now.
If you are doing things the way described above, I don't know of any 
other ways to up the speed. You may want to store the terms in a 
different way, where they would be compressed and take up less space on 
disk, thereby causing less disk IO. Perhaps you can use Lucene to 
extract and alphabetize the terms, and then transfer them into another 
file for faster access. How large is you document base? How many unique 
terms in the target field? In my experience term access is quite fast...

Also, check the IO speed to the disk you are storing this on. In my 
experience, a slow disk or a slower bus to that disk can slow things 
down by as much as 10 to 20 times!

Good luck.

>Thank's for any clues,
>Martin Sévigny
>To unsubscribe, e-mail:   <>
>For additional commands, e-mail: <>

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message