nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Schneider (JIRA)" <>
Subject [jira] Commented: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost
Date Wed, 11 Oct 2006 18:47:36 GMT
    [ ] 
Chris Schneider commented on NUTCH-385:

This comment was actually made by Ken Krugler, who was responding to Andrzej's comment above:

[with respect to Andrzej's definitions at the beginning of his comment - Ed.:]
I agree that this is one of two possible interpretations. The other is that there are N "virtual
users", and there crawlDelay applies to each of these virtual users in isolation.

Using the same type of request data from above, I see a queue of requests with the following
durations (in seconds):

4, 9, 6, 5, 6, 4, 7, 4

So with the virtual user model (where N = 2, thus "A" and "B" users), I get:

===0         1         2
A: 4+++ccc6+++++ccc6+++++ccc7++++++
B: 9++++++++ccc5++++ccc4+++ccc4+++

The numbers mark the start of each new request, and the total duration for the request.

This would seem to be less efficient than your approach, but somehow feels more in the nature
of what really means.

Let's see, for N = 3 this would look like:

===0         1         2
A: 4+++ccc5++++ccc7++++++ccc
B: 9++++++++ccc4+++ccc
C: 6+++++ccc6+++++ccc4+++ccc


To implement the virtual users model, each unique domain being actively fetched from would
need to have N bits of state tracking the time of completion of the last request.

Anyway, just an alternative interpretation...

> Server delay feature conflicts with maxThreadsPerHost
> -----------------------------------------------------
>                 Key: NUTCH-385
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Chris Schneider
> For some time I've been puzzled by the interaction between two paramters that control
how often the fetcher can access a particular host:
> 1) The server delay, which comes back from the remote server during our processing of
the robots.txt file, and which can be limited by fetcher.max.crawl.delay.
> 2) The value, particularly when this is greater than the default
of 1.
> According to my (limited) understanding of the code in
> Suppose that is 2, and that (by chance) the fetcher ends up
keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other
words, it never tries to point 3 at the host, and it always points a second thread at the
host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets
called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis()
+ crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be
used at all. The fetcher will be continuously retrieving pages from the host, often with 2
fetchers accessing the host simultaneously.
> Suppose instead that the fetcher finally does allow the last thread to complete before
it gets around to pointing another thread at the target host. When the last fetcher thread
calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME
for the host. This, in turn, will prevent any threads from accessing this host until the delay
is complete, even though zero threads are currently accessing the host.
> I see this behavior as inconsistent. More importantly, the current implementation certainly
doesn't seem to answer my original question about appropriate definitions for what appear
to be conflicting parameters. 
> In a nutshell, how could we possibly honor the server delay if we allow more than one
fetcher thread to simultaneously access the host?
> It would be one thing if whenever ( > 1), this trumped the
server delay, causing the latter to be ignored completely. That is certainly not the case
in the current implementation, as it will wait for server delay whenever the number of threads
accessing a given host drops to zero.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


View raw message