nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt
Date Wed, 07 Jun 2006 20:28:30 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415202 ] 

Andrzej Bialecki  commented on NUTCH-293:
-----------------------------------------

Stefan, as you remember we had a discussion on modifying the fetcher, and specifically changing
the thread spin-waiting mechanism into a worker-queue. As it is now this is a can of worms
that I'd rather not touch - there are many subtle conditions here that would be further complicated
by this patch. E.g. the number of spin-waiting threads vs. the number of free threads is normally
affected only by five factors: total number of threads, non-uniqueness rate in the current
fetchlist, sites' bandwidth, configured delay between requests, and allowed # of threads/host.
This patch adds a sixth factor, variable per site .. which makes it much harder to predict
how many threads you need to avoid dead-locking all of them.

I'm not strongly opposed to this change, quite contrary - this is a useful functionality.
It's just that I'm concerned that it adds yet another functionality to a messy code that needs
to be rewritten from scratch.

OTOH, it's a non-intrusive quick hack. If we have to have it now, it's definitely better than
waiting for some distant future when we rewrite the fetcher ... ;)

> support for Crawl-delay in Robots.txt
> -------------------------------------
>
>          Key: NUTCH-293
>          URL: http://issues.apache.org/jira/browse/NUTCH-293
>      Project: Nutch
>         Type: Improvement

>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Critical
>  Attachments: crawlDelayv1.patch
>
> Nutch need support for Crawl-delay defined in robots.txt, it is not a standard but a
de-facto standard.
> See:
> http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
> Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message