nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doğacan Güney" <doga...@gmail.com>
Subject Re: [Fwd: Nutch 0.9 and Crawl-Delay]
Date Tue, 05 Jun 2007 05:59:52 GMT
Hi,

On 6/4/07, Doug Cutting <cutting@apache.org> wrote:
> Does the 0.9 crawl-delay implementation actually permit multiple threads
> to access a site simultaneously?

AFAIK, yes. Option fetcher.threads.per.host should be greater than 1
_only_ when you are accessing a site under your control. So, all of
nutch's politeness policies are pretty much ignored when
fetcher.threads.per.host is greater than 1.

Fetcher2 completely ignores nutch's server-delay and site's
crawl-delay value if maxThreads > 1 and uses another min.crawl.delay
value when accessing the site.

I am not sure about Fetcher but I think it is going to allow
maxThreads many fetchers to access the site simultaneously then block
the next one.

There may be a better explanation in this post to nutch-dev:
"Fetcher2's delay between successive requests"  .


>
> Doug
>
> -------- Original Message --------
> Subject: Nutch 0.9 and Crawl-Delay
> Date: Sun, 3 Jun 2007 10:50:24 +0200
> From: Lutz Zetzsche <Lutz.Zetzsche@sea-rescue.de>
> Reply-To: nutch-agent@lucene.apache.org
> To: agent@nutch.org
>
> Dear Nutch developers,
>
> I have had problems with a Nutch based robot during the last 12 hours,
> which I have now solved by banning this particular bot from my server
> (not Nutch completely for the moment). The ilial bot, which created
> considerable load on my server, was using the latest Nutch version -
> v0.9 - which is now also supporting the crawl-delay directive in the
> robots.txt.
>
> The bot seems to have obeyed the directive - crawl-delay: 10 - as it
> visited my website every 15 seconds, which would have been ok, BUT it
> then submitted FIVE requests at once (see example log extract below)! 5
> requests at once every 15 seconds is not acceptable on my server, which
> is principally serving dynamic content and is often visited by up to 10
> search engines at the same time, alltogether surely creating 99.9% of
> the server traffic.
>
> So my suggestion is that Nutch only submits one request each time, when
> it detects a crawl-delay directive in the robots.txt. This is the
> behaviour, the MSNbot shows for example. The MSNbot also liked to
> submit several requests at once every few seconds, until I added the
> crawl-delay directive to my robots.txt.
>
>
> Best wishes
>
> Lutz Zetzsche
> http://www.sea-rescue.de/
>
>
>
> 72.44.58.191 - - [03/Jun/2007:04:40:53
> +0200] "GET /english/Photos+%26+Videos/PV/ HTTP/1.0" 200
> 13661 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet
> startup company. For more information please visit
> http://www.ilial.com/crawler; http://www.ilial.com/crawler;
> crawl@ilial.com)"
> 72.44.58.191 - - [03/Jun/2007:04:40:53
> +0200] "GET /english/Links/WRGL/Countries/ HTTP/1.0" 200
> 15048 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet
> startup company. For more information please visit
> http://www.ilial.com/crawler; http://www.ilial.com/crawler;
> crawl@ilial.com)"
> 72.44.58.191 - - [03/Jun/2007:04:40:53
> +0200] "GET /islenska/Hlekkir/Brede-ger%C3%B0%20%2F%2033%20fet/
> HTTP/1.0" 200 60041 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles
> based Internet startup company. For more information please visit
> http://www.ilial.com/crawler; http://www.ilial.com/crawler;
> crawl@ilial.com)"
> 66.249.72.244 - - [03/Jun/2007:04:40:55
> +0200] "GET /francais/Liens/Philip+Vaux/Brede%20%2F%2033%20pieds/
> HTTP/1.1" 200 17568 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;
> +http://www.google.com/bot.html)"
> 66.231.189.119 - - [03/Jun/2007:04:40:55
> +0200] "GET
> /english/Links/Martijn%20Koenraad%20Hof/Netherlands%20Antilles/Sint%20Maarten/
>
> HTTP/1.0" 200 17193 "-" "Gigabot/2.0
> (http://www.gigablast.com/spider.html)"
> 74.6.86.105 - - [03/Jun/2007:04:40:56
> +0200] "GET /dansk/Links/Hermann+Apelt/ HTTP/1.0" 200
> 30496 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp;
> http://help.yahoo.com/help/us/ysearch/slurp)"
> 72.44.58.191 - - [03/Jun/2007:04:40:53
> +0200] "GET /italiano/Links/Giamaica/MRCCs+%26+Stazioni+radio+costiera/
> HTTP/1.0" 200 16658 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles
> based Internet startup company. For more information please visit
> http://www.ilial.com/crawler; http://www.ilial.com/crawler;
> crawl@ilial.com)"
> 72.44.58.191 - - [03/Jun/2007:04:40:53
> +0200] "GET /english/Links/Mauritius/Countries/Organisations/ HTTP/1.0"
> 200 15624 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based
> Internet startup company. For more information please visit
> http://www.ilial.com/crawler; http://www.ilial.com/crawler;
> crawl@ilial.com)"
>


-- 
Doğacan Güney
Mime
View raw message