nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@cs.put.poznan.pl>
Subject Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)
Date Mon, 12 Dec 2005 09:22:35 GMT

Hi Andrzej,

This was a very interesting experiment -- thanks for sharing the results 
with us.

> The last range was the maximum in this case - Google wouldn't display 
> any hit above 652 (which I find curious, too - because the total number 
> of hits is, well, significantly higher - and Google claims to return up 
> to the first 1000 results).

I believe this may have something to do with the way Google compacts 
URLs. My guess is that initially a 1000 results is found and ranked. 
Then pruning is performed on that, leaving just a subset of results for 
the user to select from.

If you try this, self-indulging, query (with filtering enabled):

http://www.google.com/search?as_q=dawid+weiss&num=10&hl=en&as_qdr=all&as_occt=any&as_dt=i&safe=active&start=900

You get: "Results 781 - 782 of about 61,700"

Now try disabling filtering:

http://www.google.com/search?as_q=dawid+weiss&num=10&hl=en&as_qdr=all&as_occt=any&as_dt=i&safe=images&start=900

Then you get: Results 781 - 782 of about 65,500

Hmmm... still the same number of available results, but the total 
estimate is higher.

So far I used URL parameters found on the "advanced" search page. I 
tried to "display the omitted search results", as Google suggested. 
Interestingly, this lead to:

http://www.google.com/search?q=dawid+weiss&hl=en&shb=t&filter=0&start=900

"Results 541 - 549 of about 65,400 "

And that's the maximum you can get.

Sorry, my initial intuition proved wrong -- there is no clear logic 
behind the maximum limit of results you can see (unless you can find 
some logic in the fact that I can see _more_ results when I _exclude_ 
repeated ones from the total).

Dawid







Mime
View raw message