nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: IndexOptimizer (Re: Lucene performance bottlenecks)
Date Thu, 15 Dec 2005 20:10:36 GMT
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>> Doug Cutting wrote:
>>> The graph just shows that they differ, not how much better or worse 
>>> they are, since the baseline is not perfect.  When the top-10 is 50% 
>>> different, are those 5 different hits markedly worse matches to your 
>>> eye than the five they've displaced, or are they comparable?  That's 
>>> what really matters.
>> Hmm. I'm not sure I agree with this. Your reasoning would be true if 
>> we were changing the ranking formula. But the goal IMHO with these 
>> patches is to return equally complete results, using the same ranking 
>> formula.
> But we should not assume that the ranking formula is perfect.  Imagine 
> a case where the high-order bits of the score are correct and the 
> low-order bits are random.  Then an optimization which changes local 
> orderings does not actually affect result quality.

Yes, that's true, I could accept that. In these tests the score delta 
was something like 20 (between the hit #1 and hit #100), and the score 
was falling down rapidly after the first 10 or 20 results. Now, the 
problem is that many results _within_ this range (i.e. still within the 
area with large score deltas) were missing. And this suggests that the 
differences were also on the high-order bits.

Please re-run the script on your index, using typical queries, and check 
the results. It's possible that I made a mistake somewhere, it would be 
good to confirm at least the trends in the raw results.

>> I specifically avoided using normalized scores, instead using the 
>> absolute scores in TopDocs. And the absolute scores in both cases are 
>> exactly the same, for those results that are present.
>> What is wrong is that some results that should be there (judging by 
>> the ranking) are simply missing. So, it's about the recall, and the 
>> baseline index gives the best estimate.
> Yes, this optimization, by definition, hurts recall.  The only 
> question is does it substantially hurt relevance at, e.g., 10 hits.  
> If the top-10 are identical then the answer is easy: no, it does not.  
> But if they differ, we can only answer this by looking at results.  
> Chances are they're worse, but how much?  Radically?  Slightly?  
> Noticiably?

The paper by Suel et al. which you referred to claims the top-100 as 
high as 98% after optimizations. What I observed were values between 
0-60%, but going above this level caused a heavy performance loss.

>>> What part of Nutch are you trying to avoid?  Perhaps you could try 
>>> measuring your Lucene-only benchmark against a Nutch-based one.  If 
>>> they don't differ markedly then you can simply use Nutch, which 
>>> makes it a stronger benchmark.  If they differ, then we should 
>>> figure out why.
>> Again, I don't see it this way. Nutch results will always be worse 
>> than pure Lucene, because of the added layers. If I can't improve the 
>> performance in Lucene code (which takes > 85% time for every query) 
>> then no matter how well optimized Nutch code is it won't get far.
> But we're mostly modifying Nutch's use of Lucene, not modifying 
> Lucene.  So measuring Lucene alone won't tell you everything, and 
> you'll keep having to port Nutch stuff.  If you want to, e.g., replay 
> a large query log to measure average performance, then you'll need 
> things like auto-filterization, n-grams, query plugins, etc., no?

Perhaps we misunderstood each other - I'm using an index built by Nutch, 
there's no substitute for that, I agree. It was just more convenient for 
me to skip all Nutch classes for _querying_ alone, because it was easier 
to control the exact final form of Lucene query - especially if you want 
to experiment quickly with a lot of variables that are not (yet) 
parametrized through the config files. In the end, you end up with a 
plain Lucene query, only then you don't know exactly how much time was 
spent on translating, building filters, etc. *shrug* You can do it 
either way, I agree.

>>>> In several installations I use smaller values of slop (around 
>>>> 20-40). But this is motivated by better quality matches, not by 
>>>> performance, so I didn't test for this...
>>> But that's a great reason to test for it!  If lower slop can improve 
>>> result quality, then we should certainly see if it also makes 
>>> optimizations easier.
>> I forgot to mention this - the tests I ran already used the smaller 
>> values: the slop was set to 20.
> Are they different if the slop is Integer.MAX_VALUE?  It would be 
> really good to determine what causes results to diverge, whether it is 
> multiple terms (probably not) phrases (probably) and/or slop 
> (perhaps).  Chances are that the divergence is bad, that results are 
> adversely affected, and that we need to try to fix it.  But to do so 
> we'll need to understand it.

Agreed. I'll try to re-run the tests with queries that set a different 
slop value, or omit the phrases completely (and it's quite easy to do 
this with my approach, just use a different translated query on the 
cmd-line ;-) ).

>> That's another advantage of using Lucene directly in this script - 
>> you can provide any query structure on the command-line without 
>> changing the code in Nutch.
> But that just means that we should set the SLOP constant in 
> from a configuration property, and permit the 
> setting of configuration properties from the command line, no?

Well, if you want to quickly experiment with radically different query 
translation, then no.

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message