nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: IndexOptimizer (Re: Lucene performance bottlenecks)
Date Thu, 15 Dec 2005 16:48:39 GMT
Andrzej Bialecki wrote:
>>  . How were the queries generated?  From a log or randomly?
> Queries have been picked up manually, to test the worst performing cases 
> from a real query log.

So, for example, the 50% error rate might not be typical, but could be 

>>  . When results differed greatly, did they look a lot worse?
> Yes. E.g. see the differences for MAX_HITS=10000

The graph just shows that they differ, not how much better or worse they 
are, since the baseline is not perfect.  When the top-10 is 50% 
different, are those 5 different hits markedly worse matches to your eye 
than the five they've displaced, or are they comparable?  That's what 
really matters.

> I actually forgot to write that I don't use any of Nutch code. Early on 
> I decided to eliminate this part in order to get first the raw 
> performance from Lucene - but still using the Lucene queries 
> corresponding to translated Nutch queries.

What part of Nutch are you trying to avoid?  Perhaps you could try 
measuring your Lucene-only benchmark against a Nutch-based one.  If they 
don't differ markedly then you can simply use Nutch, which makes it a 
stronger benchmark.  If they differ, then we should figure out why.

> In several installations I use smaller values of slop (around 20-40). 
> But this is motivated by better quality matches, not by performance, so 
> I didn't test for this...

But that's a great reason to test for it!  If lower slop can improve 
result quality, then we should certainly see if it also makes 
optimizations easier.


View raw message