nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@nutch.org>
Subject Re: IndexOptimizer (Re: Lucene performance bottlenecks)
Date Mon, 12 Dec 2005 17:23:04 GMT
Andrzej Bialecki wrote:
> For single term queries (in Nutch - in Lucene they are rewritten to 
> complex BooleanQueries), the hit lists are nearly identical for the 
> first 10 hits, then they start to differ more and more as you progress 
> along the original hit list. This is not so surprising - after all, this 
> "optimization" operation is lossy. Still, the differences are higher 
> than it was reported in that paper by Suel (but they used a different 
> algorithm to select the postings) - Suel et al. were able to achieve 98% 
> accuracy for the top-10 results, _including_ multi-term boolean queries.

A better way to prune the index might be to look at the sum of 
query-boosted scores from the content, title, url and anchor fields for 
each term.  One could process four TermEnums in parallel, one for each 
field, and include documents in the index if the sum places them in the 
top 10%.  But this is rather complex, and I am hopeful that a simpler 
method may work better.

> For multi-term Nutch queries, which are rewritten to a combination of 
> boolean queries and sloppy phrase queries, the effects are disastrous - 

Yes, this is why I was discouraged and stopped working on this.

However I am now hopeful that sorting the entire index by page score and 
using top-1000 might work well with Nutch queries, since page score is 
field-independent, and I think fields cause the problems.  Plus, this 
would be a lot simpler than the cross-field summing described above.

I can start writing an index-sorter today, unless you are already 
working on this.  If you have an evaluation framework, that would be great.

Doug

Mime
View raw message