manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject Re: [ManifoldCF 0.5] The web crawler remains running after a network connection refused
Date Wed, 09 May 2012 06:46:11 GMT

ManifoldCF's web connector is, in general, very cautious about not
offending the owners of sites.  If it concludes that the site has
blocked access to a URL, it may remove the URL from its queue for
politeness, which would prevent further crawling of that URL for the
duration of the current job.  Under most cases, however, if a URL is
temporarily unavailable, it will be requeued for crawling at a later
time.  The typical pattern is to attempt to recrawl the URL
periodically (e.g. every 5 minutes) for many hours before giving up on
it.  For web crawling, no single URL failure will cause the job to
abort; it will continue running until all the other URLs have been
processed or forever (if the job is continuous).

You can check on the status of an individual URL by using the Document
Status report.  This report should tell you what ManifoldCF intends to
do with a specific document.  If you locate one such URL and try out
this report, what does it say?


On Tue, May 8, 2012 at 10:04 PM, 小林 茂樹(情報システム本部 / サービス企画部)
<> wrote:
> Hi guys.
> I need some advice for stopping the MCF web crawler from a running state
> when a network connection refused.
> I use MCF 0.5 with Solr 3.5. I was testing what would happen to the web
> crawler when shutting down the web site that is to be crawled. I checked the
> simple history and saw “Connection refused” with status code of “-1”, that
> looked fine. But as I was waiting, the job status never changed and remained
> running. The crawler never crawls in this situation, but when I opened the
> web site, the crawler never started crawling again either.
> At least, somehow, I want the crawler to stop from running when a network
> connection refused, but I don’t know how. Does anyone have any ideas?

View raw message