nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: IndexOptimizer (Re: Lucene performance bottlenecks)
Date Wed, 14 Dec 2005 23:16:31 GMT
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>> Ok, I just tested IndexSorter for now. It appears to work correctly, 
>> at least I get exactly the same results, with the same scores and the 
>> same explanations, if I run the smae queries on the original and on 
>> the sorted index.
>
>
> Here's a more complete version, still mostly untested.  This should 
> make searches faster.  We'll see how much good the results are...
>
> This includes a patch to Lucene to make it easier to write hit 
> collectors that collect TopDocs.
>
> I'll test this on a 38M document index tomorrow.


I tested it on a 5 mln index.

The original index is considered the "baseline", i.e. it represents 
normative values for scoring and ranking. These results are compared to 
results from the optimized index, and scores and positions of hits are 
also recorded. Finally, these two lists are matched, and relative 
differences in scoring and ranking are calculated.

At the end, I calculate the top10 %, top50% and top100%, defined as a 
percent of the top-N hits from the optimized index, which match the 
top-N hits from the baseline index. Ideally, all these measures should 
be 100%, i.e. all top-N hits from the optimized index should match 
corresponding top-N hits from the baseline index.

One variable which affects greatly both the recall and the performance 
is the maximum number of hits considered by the TopDocCollector. In my 
tests I used values between 1,000 up to 500,000 (which represents 1/10th 
of the full index in my case).

Now, the results. I collected all test results in a spreadsheet 
(OpenDocument or PDF format), you can download it from:

    http://www.getopt.org/nutch/20051214/nutchPerf.ods
    http://www.getopt.org/nutch/20051214/nutchPerf.pdf

For MAX_HITS=1000 the performance increase was ca. 40-fold, i.e. 
queries, which executed in e.g. 500 ms now executed in 10-20ms 
(perfRate=40). Following the intuition, performance drops as we increase 
MAX_HITS, until it reaches a more or less original values (perfRate=1) 
for MAX_HITS=300000 (for a 5 mln doc index). After that, increasing 
MAX_HITS actually worsens the performance (perfRate << 1) - which can be 
explained by the fact that the standard HitCollector doesn't collect as 
many documents, if they score too low.

* Single-term Nutch queries (i.e. which do not produce Lucene 
PhraseQueries) yield relatively good values of topN, even for relatively 
small values of MAX_HITS - however, MAX_HITS=1000 yields all topN=0%. 
The minimum useful value for my index was MAX_HITS=10000 (perfRate=30), 
and this yields quite acceptable top10=90%, but less acceptable top50 
and top100. Please see the spreadsheet for details.

* Two-term Nutch queries result in complex Lucene BooleanQueries over 
many index fields, includng also PhraseQueries. These fared much worse 
than single-term queries: actually, the topN values were very low until 
MAX_HITS was increased to large values, and then all of a sudden all 
topN-s flipped into the 80-90% ranges.

I also noticed that the values of topN depended strongly on the document 
frequency of terms in the query. For a two-term query, where both terms 
have average document frequency, the topN values start from ~50% for low 
MAX_HITS. For a two-term query where one of the terms has a very high 
document frequency, the topN values start from 0% for low MAX_HITS. See 
the spreadsheet for details.

Conclusions: more work is needed... ;-)

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message