nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cook (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-419) unavailable robots.txt kills fetch
Date Sat, 28 Feb 2009 19:20:12 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doug Cook updated NUTCH-419:
----------------------------

    Attachment: diffs

Here's a context diff. Hopefully this will work, am rusty at creating patches, and did it
outside of my normal development tree, since it's highly divergent from the Nutch trunk.

In any case, it's a one-liner, easy enough to add manually ;-)

> unavailable robots.txt kills fetch
> ----------------------------------
>
>                 Key: NUTCH-419
>                 URL: https://issues.apache.org/jira/browse/NUTCH-419
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1
>         Environment: Fetcher is behind a squid proxy, but I am pretty sure this is irrelevant.

> Nutch in local mode, running on a linux machine with 2GB RAM. 
>            Reporter: Carsten Lehmann
>         Attachments: diffs, last_robots.txt_requests_squidlog.txt, nutch-log.txt, squid_access_log_tail1000.txt
>
>
> I think there is another robots.txt-related problem which is not
> adressed by NUTCH-344,
> but also results in an aborted fetch.
> I am sure that in my last fetch all 17 fetcher threads died
> while they were waiting for a robots.txt-file to be delivered by a not
> properly responding web server.
> I looked at the squid access log, which is used by all fetch threads.
> It ends with many  HTTP-504-errors ("gateway timeout") caused by a
> certain robots.txt url:
> <....>
> 1166652253.332 899427 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
> 1166652343.350 899664 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
> 1166652353.560 899871 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
> These entries mean that it takes 15 minutes before the request ends
> with a timeout.
> This can be calculated from the squid log, the first column is the
> request  time (in UTC seconds), the second column is the duration of
> the request (in ms):
> 900000/1000/60=15 minutes.
> As far as I understand it, every time a fetch thread tries to get this
> robots.txt-file the thread busy waits for the duration of the request
> (15 minutes).
> If this is right, then all 17 fetcher threads were caught in this trap
> at the time when  fetching was aborted, as there are 17 requests in
> the squid log which did not timeout before the message  "aborting with
> 17 threads" was written to the nutch-logfile.
> Setting fetcher.max.crawl.delay can not help here.
> I see 296 access attempts in total concerning this robots.txt-url in
> the squid log of this crawl, but fetcher.max.crawl.delay is set to 30.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message