nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Lucene performance bottlenecks
Date Thu, 08 Dec 2005 17:49:45 GMT
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>> Hmm... Please define what "adequate" means. :-) IMHO, "adequate" is 
>> when for any query the response time is well below 1 second. 
>> Otherwise the service seems sluggish. Response times over 3 seconds 
>> are normally not acceptable.
>
>
> It depends.  Clearly an average response time of less than 1 second is 
> better than an average response time of 3 seconds.  There is no 
> argument.  That is a more useful search engine.  But a search engine 
> with a 3-second average response time is still much better than no 
> search engine at all.  If an institution cannot afford to guarantee 1 
> second average response time, must it not run a search engine?  For 
> low-traffic, non-commercial search engines, sluggishness is not a 
> fatal fault.

[...]

Yes, I fully agree with your arguments here - please accept my apologies 
if I came across as whining or complaining about that particular 
installation - quite the contrary, I think it's a unique and useful service.

My point was how to improve Nutch response time for large collections 
and for commercial situations, where the service you offer has to meet 
demanding requirements for maximum response time... the response times 
from this particular installation served just as an example of the issue.

>> There is a total of 8,435,793 pages in that index. Here's a short 
>> list of queries, the number of matching pages, and the average time 
>> (I made just a couple of tests, no stress-loading ;-) )
>>
>> * hurricane: 1,273,805 pages, 1.75 seconds.
>> * katrina: 1,267,240 pages, 1.76 seconds
>> * gov: 979,820 pages, 1.01 seconds
>
>
> These are some of the slowest terms in this index.
>
>> * hurricane katrina: 773,001 pages, 3.5 seconds (!)
>
>
> This is not a very interesting query for this collection...


That's not the point - the point is that this is a valid query that 
users may enter, and the search engine should be prepared to return 
results within certain acceptable time limits.


>>  even more time, so even for such relatively small index (from the 
>> POV of the whole Internet!) the response time may drag into several 
>> seconds (try "com").
>
>
> How often to you search for "com"?
>

Ugh... Again, that's beyond the point. It's a valid query, and a simple 
one at that, and the response time was awful.

>> Response times over several seconds would mean that users would say 
>> goodbye and never return... ;-)
>
>
> So, tell me, where will these users then search for archived web 
> content related to hurricane katrina?  There is no other option.  If 
> this were a competitive commercial offering, then some sluggishness 
> would indeed be unacceptable, and ~10M pages in a single index might 
> be too many on today's processors.  But in a non-profit unique 
> offering, I don't think this is unacceptable.  Not optimal, but 
> workable.  Should the archive refuse to make this content searchable 
> until they have faster or more machines, or until Nutch is faster?  I 
> don't think so.
>

I hope I explained that I didn't complain about this particular 
installation. I just used this installation to illustrate the problem 
that I see also in other installation, where the demands are much higher 
and much more difficult to meet.


>> If 10 mln docs is too much for a single server to meet such a 
>> performance target, then this explodes the total number of servers 
>> required to handle Internet-wide collections of billions pages...
>>
>> So, I think it's time to re-think the query structure and scoring 
>> mechanisms, in order to simplify the Lucene queries generated by 
>> Nutch - or to do some other tricks...
>
>
> I think "other tricks" will be more fruitful.  Lucene is pretty well 
> optimized, and I don't think qualitative improvements can be had by 
> simplifying the queries without substantially reducing their 
> effectiveness.
>
> The trick that I think would be most fruitful is something like what 
> Torsten Suel describes in his paper titled "Optimized Query Execution 
> in Large Search Engines with Global Page Ordering".
>
> http://cis.poly.edu/suel/papers/order.pdf
> http://cis.poly.edu/suel/talks/order-vldb.ppt
>
> I beleive all of the major search engines implement something like 
> this, where heuristics are used to avoid searching the complete 
> index.  (We certainly did so at Excite.)  The results are no longer 
> guaranteed to always be the absolute highest-scoring, but in most 
> cases are nearly identical.
>
> Implementing something like this for Lucene would not be too 
> difficult.  The index would need to be re-sorted by document boost: 
> documents would be re-numbered so that highly-boosted documents had 
> low document numbers.  Then a HitCollector can simply stop searching 
> once a given number of hits are found.


Now we are talking .. ;-) This sounds relatively simple and worth 
trying. Thanks for the pointers!

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message