nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cook (JIRA)" <>
Subject [jira] Commented: (NUTCH-419) unavailable robots.txt kills fetch
Date Sat, 28 Feb 2009 19:06:12 GMT


Doug Cook commented on NUTCH-419:

I ran into this same problem, and spent some time debugging it. Here's what's going on:

Slow sites, or sites that hang, are causing Fetcher threads to hang for very long periods
of time.

I was running a fetcher with 400 threads, and found it getting slower and slower as the fetch
progressed. A look at the stack  over time showed that over time, more and more threads blocked
in the same place, in MultiThreadedHttpConnectionManager, waiting for a connection, called
via http.getResponse() in getRobotRulesSet(). After hours of running, more than 390 of 400
threads were blocked! 

A little sleuthing revealed what's going on. This happens to be the one place in the code
where we call http.getResponse() directly, instead of via getProtocolOutput, which does its
own connection throttling, so we're relying on MultiThreadedHttpConnectionManager's connection
throttling. The problem was that MultiThreadedHttpConnectionManager was ignoring the connection
timeout, and waiting indefinitely for any other running threads to release a connection to
the same site. If there are any large sites which time out on /robots.txt fetches, threads
will "pile up" over time, since they're not timing out, and the RobotRules cache doesn't cache
timeouts, so every new thread hitting that site will try to fetch /robots.txt, and hang for
an increasing amount of time as it has to wait for the (ever-increasing # of) previous threads
to give up and relinquish the single connection.

We seem to be victims of HttpClient's somewhat byzantine parameter architecture. We do set
the timeout, but we set it on the MultiThreadedHttpConnectionManager, which ignores it. If
I set the parameter on the HttpClient instead, the problem goes away. I haven't looked at
the HttpClient source, but I'm guessing that what happens is that internally, HttpClient is
calling MultiThreadedHttpConnectionManager.getConnectionWithTimeout or some such, thereby
overriding the connection manager's own timeouts. At any rate, the fix is very straightforward,
and a patch follows.

I highly recommend this patch, it hugely sped up my crawl, and I suspect it will do so for
others with similar configurations and at least one slow or hung large-ish site in the crawl.

There are two related issues: one, that the RobotRules cache doesn't keep track of failure
cases, so sites that are down will be accessed many, many times; and two, that simultaneous
/robots.txt accesses to the same site will all try to access the site. I've got a locally-modified
version of getRobotRulesSet() that fixes these two problems as well. It's not perfect, but
it makes the fetch both faster and more polite. I'll try to spiffy that up and include it
in a future patch, if anyone's interested.

> unavailable robots.txt kills fetch
> ----------------------------------
>                 Key: NUTCH-419
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1
>         Environment: Fetcher is behind a squid proxy, but I am pretty sure this is irrelevant.

> Nutch in local mode, running on a linux machine with 2GB RAM. 
>            Reporter: Carsten Lehmann
>         Attachments: last_robots.txt_requests_squidlog.txt, nutch-log.txt, squid_access_log_tail1000.txt
> I think there is another robots.txt-related problem which is not
> adressed by NUTCH-344,
> but also results in an aborted fetch.
> I am sure that in my last fetch all 17 fetcher threads died
> while they were waiting for a robots.txt-file to be delivered by a not
> properly responding web server.
> I looked at the squid access log, which is used by all fetch threads.
> It ends with many  HTTP-504-errors ("gateway timeout") caused by a
> certain robots.txt url:
> <....>
> 1166652253.332 899427 TCP_MISS/504 1450 GET
> - DIRECT/ text/html
> 1166652343.350 899664 TCP_MISS/504 1450 GET
> - DIRECT/ text/html
> 1166652353.560 899871 TCP_MISS/504 1450 GET
> - DIRECT/ text/html
> These entries mean that it takes 15 minutes before the request ends
> with a timeout.
> This can be calculated from the squid log, the first column is the
> request  time (in UTC seconds), the second column is the duration of
> the request (in ms):
> 900000/1000/60=15 minutes.
> As far as I understand it, every time a fetch thread tries to get this
> robots.txt-file the thread busy waits for the duration of the request
> (15 minutes).
> If this is right, then all 17 fetcher threads were caught in this trap
> at the time when  fetching was aborted, as there are 17 requests in
> the squid log which did not timeout before the message  "aborting with
> 17 threads" was written to the nutch-logfile.
> Setting fetcher.max.crawl.delay can not help here.
> I see 296 access attempts in total concerning this robots.txt-url in
> the squid log of this crawl, but fetcher.max.crawl.delay is set to 30.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message