nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: IndexOptimizer (Re: Lucene performance bottlenecks)
Date Thu, 15 Dec 2005 05:15:34 GMT
Andrzej Bialecki wrote:
> I tested it on a 5 mln index.

Thanks, this is great data!

Can you please tell a bit more about the experiments?  In particular:
  . How were scores assigned to pages?  Link analysis?  log(number of 
incoming links) or OPIC?
  . How were the queries generated?  From a log or randomly?
  . How many queries did you test with?
  . When results differed greatly, did they look a lot worse?

My attempt to sort a 38M page index failed with OutOfMemory.  Sigh.

> For MAX_HITS=1000 the performance increase was ca. 40-fold, i.e. 
> queries, which executed in e.g. 500 ms now executed in 10-20ms 
> (perfRate=40). Following the intuition, performance drops as we increase 
> MAX_HITS, until it reaches a more or less original values (perfRate=1) 
> for MAX_HITS=300000 (for a 5 mln doc index). After that, increasing 
> MAX_HITS actually worsens the performance (perfRate << 1) - which can be 
> explained by the fact that the standard HitCollector doesn't collect as 
> many documents, if they score too low.

This doesn't make sense to me.  It should never be slower.  We're not 
actually keeping track of any more hits, only stopping earlier.

> * Two-term Nutch queries result in complex Lucene BooleanQueries over 
> many index fields, includng also PhraseQueries. These fared much worse 
> than single-term queries: actually, the topN values were very low until 
> MAX_HITS was increased to large values, and then all of a sudden all 
> topN-s flipped into the 80-90% ranges.

It would be interesting to try altering the generated query, to see if 
it is the phrases or simply multiple terms which cause problems.  To do 
this, one could hack the query-basic plugin, or simply alter query boost 
parameters.  This would help us figure out where the optimization is 
failing.  Suel used multi-term queries, but not phrases, so we expect 
that the phrases are causing the problem, but it would be good to see 
for certain.  We've also never tuned Nutch's phrase matching, so it's 
also possible that we may sometimes over-emphasize the phrase component 
in scores.  For example, a slop of 10 might give better results and/or 
be more amenable to this optimization.

> I also noticed that the values of topN depended strongly on the document 
> frequency of terms in the query. For a two-term query, where both terms 
> have average document frequency, the topN values start from ~50% for low 
> MAX_HITS. For a two-term query where one of the terms has a very high 
> document frequency, the topN values start from 0% for low MAX_HITS. See 
> the spreadsheet for details.

Were these actually useful queries?  For example, I would not be 
concerned if results differed greatly for a query like 'to be', since 
that's not a very useful query.  Try searching for 'the the' on Google.



View raw message