nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Otis Gospodnetic (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-629) Detect slow and timeout servers and drop their URLs
Date Sat, 12 Apr 2008 07:07:04 GMT
Detect slow and timeout servers and drop their URLs
---------------------------------------------------

                 Key: NUTCH-629
                 URL: https://issues.apache.org/jira/browse/NUTCH-629
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
            Reporter: Otis Gospodnetic


Fetch jobs will finish faster if we find a way to prevent servers that are either slow or
time out from slowing down the whole process.

I'll attach a patch that counts per-server exceptions and timeouts and tracks download speed
per server.
Queues/sservers that exceed timeout or download thresholds are marked as "tooManyErrors" or
"tooSlow".  Once they get marked as such, all of their subsequent URLs get dropped (i.e. they
do not fetched) and marked GONE.

At the end of the fetch task, stats for each server processed are printed.

Also, I believe the per-host/domain/TLD/etc. DB from NUTCH-628 would be the right place to
add server data collected by this patch.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message