nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Lucene performance bottlenecks
Date Thu, 08 Dec 2005 09:04:32 GMT
(Moving the discussion to nutch-dev, please drop the cc: when responding)

Doug Cutting wrote:

> Andrzej Bialecki wrote:
>> It's nice to have these couple percent... however, it doesn't solve 
>> the main problem; I need 50 or more percent increase... :-) and I 
>> suspect this can be achieved only by some radical changes in the way 
>> Nutch uses Lucene. It seems the default query structure is too 
>> complex to get a decent performance.
> That would certainly help.
> For what it's worth, the Internet Archive has ~10M page Nutch indexes 
> that perform adequately.  See:

Hmm... Please define what "adequate" means. :-) IMHO, "adequate" is when 
for any query the response time is well below 1 second. Otherwise the 
service seems sluggish. Response times over 3 seconds are normally not 
acceptable. This is just for a single concurrent query - the number of 
concurrent queries will be a function of the number of concurrent users, 
and the search response time, until it reaches the limit of the number 
of threads on the search servers. Then, the time it takes to return the 
results should give us the maximum concurrent query-per-second estimate.

There is a total of 8,435,793 pages in that index. Here's a short list 
of queries, the number of matching pages, and the average time (I made 
just a couple of tests, no stress-loading ;-) )

* hurricane: 1,273,805 pages, 1.75 seconds.
* katrina: 1,267,240 pages, 1.76 seconds
* gov: 979,820 pages, 1.01 seconds
* hurricane katrina: 773,001 pages, 3.5 seconds (!)
* "hurricane katrina": 600,867 pages, 1.35 seconds
* disaster relief: 205,066 pages, 1.12 seconds
* "disaster relief": 140,007 pages, 0.42 seconds
* hurricane katrina disaster relief: 129,353 pages, 1.99 seconds
* "hurricane katrina disaster relief": 2,006 pages, 0.705 seconds
* xys: 227 pages, 0.01 seconds
* xyz: 3,497 pages,  0.005 seconds

> The performance is about what you report, but it is quite usable. 
> (Please don't stress-test this server!)  We recently built a ~100M 
> page Nutch index at the Internet Archive that is surprisingly usable 
> on a single CPU.  (This is not yet publicly accessible.)

What I found out is that "usable" depends a lot on how you test it and 
what is your minimum expectation. There are some high-frequency terms 
(and by this I mean terms with frequency around 25%) that will 
consistently cause a dramatic slowdown. Multi-term queries, because of 
the way Nutch expands them into sloppy phrases, may take even more time, 
so even for such relatively small index (from the POV of the whole 
Internet!) the response time may drag into several seconds (try "com").

> Perhaps your traffic will be much higher than the Internet Archive's, 
> or you have contractual obligations that specify certain average query 
> performance, but, if not, ~10M pages is quite searchable using Nutch 
> on a single CPU.

I'm not concerned about the traffic - I believe the distributed search 
can handle a lot of traffic if need be. What I'm concerned about is the 
maximum response time from individual search servers. This is because 
the front-end response time is determined by the longest response time 
from any of the (active) search servers. Response times over 1 sec. from 
a 10 mln collection are IMHO not adequate, because the service will 
appear slow. Response times over several seconds would mean that users 
would say goodbye and never return... ;-)

If 10 mln docs is too much for a single server to meet such a 
performance target, then this explodes the total number of servers 
required to handle Internet-wide collections of billions pages...

So, I think it's time to re-think the query structure and scoring 
mechanisms, in order to simplify the Lucene queries generated by Nutch - 
or to do some other tricks...

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message