nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: IndexOptimizer (Re: Lucene performance bottlenecks)
Date Thu, 15 Dec 2005 17:10:24 GMT
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>>>  . How were the queries generated?  From a log or randomly?
>> Queries have been picked up manually, to test the worst performing 
>> cases from a real query log.
> So, for example, the 50% error rate might not be typical, but could be 
> worst-case.

Yes, that's true.

>>>  . When results differed greatly, did they look a lot worse?
>> Yes. E.g. see the differences for MAX_HITS=10000
> The graph just shows that they differ, not how much better or worse 
> they are, since the baseline is not perfect.  When the top-10 is 50% 
> different, are those 5 different hits markedly worse matches to your 
> eye than the five they've displaced, or are they comparable?  That's 
> what really matters.

Hmm. I'm not sure I agree with this. Your reasoning would be true if we 
were changing the ranking formula. But the goal IMHO with these patches 
is to return equally complete results, using the same ranking formula.

I specifically avoided using normalized scores, instead using the 
absolute scores in TopDocs. And the absolute scores in both cases are 
exactly the same, for those results that are present.

What is wrong is that some results that should be there (judging by the 
ranking) are simply missing. So, it's about the recall, and the baseline 
index gives the best estimate.

>> I actually forgot to write that I don't use any of Nutch code. Early 
>> on I decided to eliminate this part in order to get first the raw 
>> performance from Lucene - but still using the Lucene queries 
>> corresponding to translated Nutch queries.
> What part of Nutch are you trying to avoid?  Perhaps you could try 
> measuring your Lucene-only benchmark against a Nutch-based one.  If 
> they don't differ markedly then you can simply use Nutch, which makes 
> it a stronger benchmark.  If they differ, then we should figure out why.

Again, I don't see it this way. Nutch results will always be worse than 
pure Lucene, because of the added layers. If I can't improve the 
performance in Lucene code (which takes > 85% time for every query) then 
no matter how well optimized Nutch code is it won't get far.

So, I'm reproducing the same queries that Nutch passes to Lucene, in 
order to simulate the typical load generated by Nutch, but avoiding any 
non-essential code that could skew the results. What's wrong with that? 
I think this approach makes the benchmark easier to understand and 
isolated from non-essential factors. When we reach a significant 
improvement on this level, of course we should move to use Nutch code to 
test how well it works on the top.

>> In several installations I use smaller values of slop (around 20-40). 
>> But this is motivated by better quality matches, not by performance, 
>> so I didn't test for this...
> But that's a great reason to test for it!  If lower slop can improve 
> result quality, then we should certainly see if it also makes 
> optimizations easier.

I forgot to mention this - the tests I ran already used the smaller 
values: the slop was set to 20.

That's another advantage of using Lucene directly in this script - you 
can provide any query structure on the command-line without changing the 
code in Nutch.

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message