I'm still having problems with web crawling using trunk with updated Http client. It seems that the problems occur when Solr is password protected even though the error messages in my logs indicate a timeout problem. I'm not 100 % sure, but it seems that the problem starts as soon as I'm enabling password protection.
We have struggled a lot with the web crawler in production mode recently, but I thought that we managed to get around these problems when "expect 100 continue" was added to the header (now added in trunk). Then we discovered a Resin bug which sent a wrong http status code back when this header was enabled, but this has been solved by moving the authentication configuration to Apache HTTP server instead (using .htaccess). So everything *should* work, but it doesn't. Now I have managed to reproduce the problems on our test sever as well when I added full password protection for the Solr test server. As I wrote above, the logs does not seem to report problems with the Solr server, but the crawled resources instead.
I have added two logs. One from the production server, and another from the test server. Log level is set to DEBUG for HttpClient. The prod job just stops and hangs, maybe due to a db lock. The test stops with the message "Error: Repeated service interruptions - failure processing document: null" ("read timed out" in simple history).
The logs are available here:
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050