nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject IndexOptimizer (Re: Lucene performance bottlenecks)
Date Mon, 12 Dec 2005 16:32:59 GMT
Doug Cutting wrote:

> The IndexOptimizer.java class in the searcher package was an old 
> attempt to create something like what Suel calls "fancy postings".  It 
> creates an index with the top 10% scoring postings.  Since documents 
> are not renumbered one can intermix postings from this with the full 
> index.  So for example, one can first try searching using this index 
> for terms that occur more than, e.g., 10k times, and use the full 
> index for rarer words.  If that does not find 1000 hits then the full 
> index must be searched.  Such an approach can be combined with using a 
> pre-sorted index.


I tested the IndexOptimizer, comparing the result lists from the 
original and the optimized index.

The trick in the original IndexOptimizer to avoid copying field data 
doesn't work anymore - it throws exceptions during segment merging. I 
"fixed it" by commenting out overriden numDocs() and maxDoc() in 
OptimizingReader.

Then, after analyzing the explanations I came to conclusion that the 
IDFs are calculated based on the original ratios of docFreq/numDocs, so 
I needed to modify Similarity.idf() to account for the changed 
docFreq/numDocs (by FRACTION).

The results, speed-wise, were very encouraging - however, after 
comparing the hit lists I discovered that they differed significantly.

For single term queries (in Nutch - in Lucene they are rewritten to 
complex BooleanQueries), the hit lists are nearly identical for the 
first 10 hits, then they start to differ more and more as you progress 
along the original hit list. This is not so surprising - after all, this 
"optimization" operation is lossy. Still, the differences are higher 
than it was reported in that paper by Suel (but they used a different 
algorithm to select the postings) - Suel et al. were able to achieve 98% 
accuracy for the top-10 results, _including_ multi-term boolean queries.

For multi-term Nutch queries, which are rewritten to a combination of 
boolean queries and sloppy phrase queries, the effects are disastrous - 
I could barely manage to get some of the matching hits within the first 
300 results, and their order was completely at odds with the original 
hit list. This is probably due to the scoring of sloppy phrases - I need 
to modify the test scripts to compare the explanations from matching 
results...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message