manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject Re: Timeout problems with web crawling
Date Tue, 23 Apr 2013 11:00:05 GMT
I take back the "no exceptions" comment.  We are getting one in the
testhost log:

 INFO 2013-04-22 17:39:39,387 (Worker thread '27') - WEB: FETCH
Read timed out
 WARN 2013-04-22 17:39:39,387 (Worker thread '27') - Pre-ingest
service interruption reported for job 1360671306324 connection
'web_crawler': Timed out waiting for IO for
'': Read timed

It really does seem to be a socket timeout.  It looks like it was able
to establish a connection, but then waited 5 minutes for any data to
appear.  Can you fetch this URL without problem using the same headers
- esp. the User-Agent header?  It may be that your crawler is being
blocked by this site.


On Tue, Apr 23, 2013 at 6:50 AM, Karl Wright <> wrote:

> The solr indexing seems to be working fine on the test host.  I haven't
> verified that is true on the production host.  The cause of the production
> host hanging, though, may be the really awful stuffer query plan.  It seems
> to hang but in fact just gets very very slow.
> Can you dump the postgresql schema that is in place on the production
> machine?  Specifically, I want to see the jobqueue table's indexes.
> I do not see any exceptions at all logged either place.  If there's a
> service interruption, usually a warning log entry is dumped.  Not seeing
> that though.
> On Tue, Apr 23, 2013 at 6:22 AM, Erlend GarĂ¥sen <>wrote:
>> I'm still having problems with web crawling using trunk with updated Http
>> client. It seems that the problems occur when Solr is password protected
>> even though the error messages in my logs indicate a timeout problem. I'm
>> not 100 % sure, but it seems that the problem starts as soon as I'm
>> enabling password protection.
>> We have struggled a lot with the web crawler in production mode recently,
>> but I thought that we managed to get around these problems when "expect 100
>> continue" was added to the header (now added in trunk). Then we discovered
>> a Resin bug which sent a wrong http status code back when this header was
>> enabled, but this has been solved by moving the authentication
>> configuration to Apache HTTP server instead (using .htaccess). So
>> everything *should* work, but it doesn't. Now I have managed to reproduce
>> the problems on our test sever as well when I added full password
>> protection for the Solr test server. As I wrote above, the logs does not
>> seem to report problems with the Solr server, but the crawled resources
>> instead.
>> I have added two logs. One from the production server, and another from
>> the test server. Log level is set to DEBUG for HttpClient. The prod job
>> just stops and hangs, maybe due to a db lock. The test stops with the
>> message "Error: Repeated service interruptions - failure processing
>> document: null" ("read timed out" in simple history).
>> The logs are available here:
>> Erlend
>> --
>> Erlend GarĂ¥sen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>> 31050

View raw message