nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@nutch.org>
Subject Re: Lucene performance bottlenecks
Date Thu, 08 Dec 2005 17:28:03 GMT
Andrzej Bialecki wrote:
> Hmm... Please define what "adequate" means. :-) IMHO, "adequate" is when 
> for any query the response time is well below 1 second. Otherwise the 
> service seems sluggish. Response times over 3 seconds are normally not 
> acceptable.

It depends.  Clearly an average response time of less than 1 second is 
better than an average response time of 3 seconds.  There is no 
argument.  That is a more useful search engine.  But a search engine 
with a 3-second average response time is still much better than no 
search engine at all.  If an institution cannot afford to guarantee 1 
second average response time, must it not run a search engine?  For 
low-traffic, non-commercial search engines, sluggishness is not a fatal 
fault.

> There is a total of 8,435,793 pages in that index. Here's a short list 
> of queries, the number of matching pages, and the average time (I made 
> just a couple of tests, no stress-loading ;-) )
> 
> * hurricane: 1,273,805 pages, 1.75 seconds.
> * katrina: 1,267,240 pages, 1.76 seconds
> * gov: 979,820 pages, 1.01 seconds

These are some of the slowest terms in this index.

> * hurricane katrina: 773,001 pages, 3.5 seconds (!)

This is not a very interesting query for this collection...

> * "hurricane katrina": 600,867 pages, 1.35 seconds
> * disaster relief: 205,066 pages, 1.12 seconds
> * "disaster relief": 140,007 pages, 0.42 seconds
> * hurricane katrina disaster relief: 129,353 pages, 1.99 seconds
> * "hurricane katrina disaster relief": 2,006 pages, 0.705 seconds
> * xys: 227 pages, 0.01 seconds
> * xyz: 3,497 pages,  0.005 seconds

> What I found out is that "usable" depends a lot on how you test it and 
> what is your minimum expectation. There are some high-frequency terms 
> (and by this I mean terms with frequency around 25%) that will 
> consistently cause a dramatic slowdown. Multi-term queries, because of 
> the way Nutch expands them into sloppy phrases, may take even more time, 
> so even for such relatively small index (from the POV of the whole 
> Internet!) the response time may drag into several seconds (try "com").

How often to you search for "com"?

> Response times over several seconds would mean that users 
> would say goodbye and never return... ;-)

So, tell me, where will these users then search for archived web content 
related to hurricane katrina?  There is no other option.  If this were a 
competitive commercial offering, then some sluggishness would indeed be 
unacceptable, and ~10M pages in a single index might be too many on 
today's processors.  But in a non-profit unique offering, I don't think 
this is unacceptable.  Not optimal, but workable.  Should the archive 
refuse to make this content searchable until they have faster or more 
machines, or until Nutch is faster?  I don't think so.

> If 10 mln docs is too much for a single server to meet such a 
> performance target, then this explodes the total number of servers 
> required to handle Internet-wide collections of billions pages...
> 
> So, I think it's time to re-think the query structure and scoring 
> mechanisms, in order to simplify the Lucene queries generated by Nutch - 
> or to do some other tricks...

I think "other tricks" will be more fruitful.  Lucene is pretty well 
optimized, and I don't think qualitative improvements can be had by 
simplifying the queries without substantially reducing their effectiveness.

The trick that I think would be most fruitful is something like what 
Torsten Suel describes in his paper titled "Optimized Query Execution in 
Large Search Engines with Global Page Ordering".

http://cis.poly.edu/suel/papers/order.pdf
http://cis.poly.edu/suel/talks/order-vldb.ppt

I beleive all of the major search engines implement something like this, 
where heuristics are used to avoid searching the complete index.  (We 
certainly did so at Excite.)  The results are no longer guaranteed to 
always be the absolute highest-scoring, but in most cases are nearly 
identical.

Implementing something like this for Lucene would not be too difficult. 
  The index would need to be re-sorted by document boost: documents 
would be re-numbered so that highly-boosted documents had low document 
numbers.  Then a HitCollector can simply stop searching once a given 
number of hits are found.

Doug



Mime
View raw message