nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@nutch.org>
Subject Re: IndexOptimizer (Re: Lucene performance bottlenecks)
Date Wed, 14 Dec 2005 17:49:50 GMT
Andrzej Bialecki wrote:
> I'll test it soon - one comment, though. Currently you use a subclass of 
> RuntimeException to stop the collecting. I think we should come up with 
> a better mechanism - throwing exceptions is too costly.

I thought about this, but I could not see a simple way to achieve it. 
And one exception thrown per query is not very expensive.  But it is bad 
style.  Sigh.

> Perhaps the 
> HitCollector.collect() method should return a boolean to signal whether 
> the searcher should continue working.

We don't really want a HitCollector in this case: we want a TopDocs.  So 
the patch I made is required: we need to extend the HitCollector that 
implements TopDocs-based searching.

Long-term, to avoid the 'throw', we'd need to also:

1. Change:
      TopDocs Searchable.search(Query, Filter, int numHits)
    to:
      TopDocs Searchable.search(Query, Filter, int numHits, maxTotalHits)

2. Add, for back-compatibility:
      TopDocs Searcher.search(Query, Filter, int numHits) {
        return search(query, filter, numHits, Integer.MAX_VALUE);
      }

3. Add a new method:
      /** Return false to stop hit processing. */
      boolean HitCollector.processHit(int doc, float score) {
        collect(doc, score);   // for back-compatibility
        return true;
      }
    Then change all calls to HitCollector.collect to instead call this,
    and deprecate HitCollector.collect.

I think that would do it.  But is it worth it?

In the past I've frequently wanted to be able to extend TopDocs-based 
searching, so I think the Lucene patch I've constructed so far is 
generally useful.

Doug

Mime
View raw message