nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: IndexOptimizer (Re: Lucene performance bottlenecks)
Date Tue, 13 Dec 2005 11:39:57 GMT
Andrzej Bialecki wrote:
> Shouldn't this be combined with a HitCollector that collects only the 
> first-n matches? Otherwise we still need to scan the whole posting list...

Yes.  I was just posting the work-in-progress.

We will also need to estimate the total number of matches by 
extrapolating linearly from the maximum doc id processed.  Finally, it 
is probably rather slow for large indexes, whose .fdt won't fit in 
memory.  A simple way to improve that might be to use 
Similarity.floatToByte-encoded floats when sorting (e.g., the norm from 
an untokenized field) so that documents whose boosts are close are not 
re-ordered.  I'll start work on these in the morning.  (It is currently 
my middle-of-night.)


View raw message