nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Lucene performance bottlenecks
Date Thu, 08 Dec 2005 17:59:25 GMT
Doug Cutting wrote:
> Implementing something like this for Lucene would not be too difficult. 
> The index would need to be re-sorted by document boost: documents would 
> be re-numbered so that highly-boosted documents had low document 
> numbers.

In particular, one could:

1. Create an array of int[maxDoc], with a[i] = i.
2. Sort the array with order(i,j) = boost(i) - boost(j);
3. Implement a FilterIndexReader that re-numbers using the sorted array. 
  So, for example, the document numbers in the TermPositions will 
a[old.doc()].  Each term's positions will need to be read entirely into 
memory and sorted to perform this renumbering.

The class in the searcher package was an old attempt 
to create something like what Suel calls "fancy postings".  It creates 
an index with the top 10% scoring postings.  Since documents are not 
renumbered one can intermix postings from this with the full index.  So 
for example, one can first try searching using this index for terms that 
occur more than, e.g., 10k times, and use the full index for rarer 
words.  If that does not find 1000 hits then the full index must be 
searched.  Such an approach can be combined with using a pre-sorted index.

I think the first thing to implement would be to implement something 
like what Suel calls first-1000.  Then we need to evaluate this and 
determine, for query log, how different the results are.

> Then a HitCollector can simply stop searching once a given 
> number of hits are found.
> Doug

View raw message