From nutch-dev-return-2995-apmail-lucene-nutch-dev-archive=lucene.apache.org@lucene.apache.org Mon Dec 12 09:59:06 2005 Return-Path: Delivered-To: apmail-lucene-nutch-dev-archive@www.apache.org Received: (qmail 23400 invoked from network); 12 Dec 2005 09:59:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 12 Dec 2005 09:59:04 -0000 Received: (qmail 57366 invoked by uid 500); 12 Dec 2005 09:59:03 -0000 Delivered-To: apmail-lucene-nutch-dev-archive@lucene.apache.org Received: (qmail 57012 invoked by uid 500); 12 Dec 2005 09:59:02 -0000 Mailing-List: contact nutch-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-dev@lucene.apache.org Delivered-To: mailing list nutch-dev@lucene.apache.org Received: (qmail 56997 invoked by uid 99); 12 Dec 2005 09:59:02 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Dec 2005 01:59:02 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [69.44.16.11] (HELO getopt.org) (69.44.16.11) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Dec 2005 01:59:01 -0800 Received: from [192.168.0.252] (75-mo3-2.acn.waw.pl [62.121.105.75]) (authenticated) by getopt.org (8.11.6/8.11.6) with ESMTP id jBC9wge13563 for ; Mon, 12 Dec 2005 03:58:42 -0600 Message-ID: <439D49D1.3020406@getopt.org> Date: Mon, 12 Dec 2005 10:58:41 +0100 From: Andrzej Bialecki User-Agent: Mozilla Thunderbird 1.0.7 (Windows/20050923) X-Accept-Language: en-us, en MIME-Version: 1.0 To: nutch-dev@lucene.apache.org Subject: Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks) References: <3287125F0F0CCE4E8F55435208100CF534EE81@exchange-mbx.be.bvd> <439742A9.9010000@getopt.org> <43974BDA.9040203@apache.org> <4397F720.9070007@getopt.org> <43986D23.6020107@nutch.org> <4398747D.8040701@nutch.org> <43995198.3030003@getopt.org> <439D415B.3080506@cs.put.poznan.pl> In-Reply-To: <439D415B.3080506@cs.put.poznan.pl> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Dawid Weiss wrote: > > Hi Andrzej, > > This was a very interesting experiment -- thanks for sharing the > results with us. > >> The last range was the maximum in this case - Google wouldn't display >> any hit above 652 (which I find curious, too - because the total >> number of hits is, well, significantly higher - and Google claims to >> return up to the first 1000 results). > > > I believe this may have something to do with the way Google compacts > URLs. My guess is that initially a 1000 results is found and ranked. > Then pruning is performed on that, leaving just a subset of results > for the user to select from. > That was my guess, too ... > Sorry, my initial intuition proved wrong -- there is no clear logic > behind the maximum limit of results you can see (unless you can find > some logic in the fact that I can see _more_ results when I _exclude_ > repeated ones from the total). Well, trying not to sound too much like Spock... Fascinating :-), but the only logical conclusion is that at the user end we never deal with any hard results calculated directly from the hypothetical "main index", we deal just with rough estimates from the "estimated indexes". These change in time, and perhaps even with the group of servers that answered this particular query... My guess is that there could be different "estimated" indexes prepared for different values of the main boolean parameters, like filter=0... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com