lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <>
Subject Re: constant scoring queries
Date Wed, 18 May 2005 21:06:48 GMT
> > > contains(docid) and exists(docid) cannot be efficiently implemented
> > > on a VInt based compact filter, so I'd prefer to leave these out.
> >
> > exists() on a BitSet is much faster than next() though...
> Yes, but the point is to iterate to the next document based in information
> from RAM and to be able to skipTo() on the index instead of reading it
> sequentially.

Well, yes, that's one point to filters (and probably the main use). 
Another that we are using is to enable fast intersection of two
filters you already have in memory.

> > I use a power-of-two hash table with a load factor of .75.  So putting
> > 500 docs in my hashset would take up 1024 slots at 4 bytes per slot
> > (4k).
> So about 8 bytes per doc. A SortedVIntList normally has 1 byte per doc,
> and never more than 4 bytes per doc (as long as doc numbers are int).

It depends on your platform and what tradeoffs you want too... our
production boxes all have 16G RAM.

> > Hashes also have big speed advantage over BitSets for iterating over
> > all docs or taking intersection sizes.  The hash also has fast random
> > access that docNrSkipper doesn't have.
> Can it take determine the intersection size faster than iterating over
> both sets with an intersecting merge?

In many cases, yes.
Consider the example of taking the intersection since of a small set
with a large.  You need fast random access (or a fast skip) on the
large set.

> > entire query to constant scoring (for instance when a non-score sort
> > is specified and the user doesn't care about the score).
> Sounds perfect to me, but I've never looked at the sorting code in depth.
> Do the non-score sort methods use HitCollector?
> That might be wasteful because it provides the score value for each doc.

Yes, but sometimes the score may still be desired even when sorting by
another field... so it should be configurable on a per-search basis or
per-query basis somehow.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message