nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Quick <edwardqu...@hotmail.com>
Subject fetch an ammeded url
Date Wed, 03 Sep 2008 19:43:39 GMT

Hi

Please can someone point me in the right direction. I have a problem when scanning our intranet
because many of the pages return code 500 as illustrated in the headers below, which (correctly
I agree) gives httpclient the impression the GET failed. However the server actually redirects
the GET by appending "?OpenDocument" on the end of the initial url requested.

 I don't think there's a way to get round this in the configuration so I looked at fetcher.java
and tried to get it to refetch the url with "?OpenDocument" appended but my code didn't work.
I can't really figure out how it works! duh! Could someone tell me how to get nutch to refetch
the ammended url please if httpclient gets a 500 back?

Thanks,

Ed.


http://planetba.baplc.com/general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes

GET /general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes
HTTP/1.1
Host: planetba.baplc.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie:
ObSSOCookie=DdKzZ2Ebcglw9MjchanSFA%2FKN0agvrJTAe6PEGDHOXTeEgfmCrvqYCxVBY0qwU24Xb2T6MV3%2BUwrIfNhKVQA97J54%2Fd2%2BjetZjNoC98N4638eJpf3ZDyE50llsTdOAADaNn%2BjqVfeFrvDjJ2agM1Pxo1Y7DGR0yME1P0%2FHcd6XgFaHwEq9CyUvPq5k6mKMr7Vy4oiZS75RRPAJwNTOxoj7cLuwHX%2Fugj2GJ%2F8Jdynj6Ov1rxgeCWqGdm1ltqEma1TkAbKayt8RtilHwZxRmYDRc3tnGlaqauVUZDNVNE3B3L3bQDyfaFWaDHuX3r67CP

HTTP/1.x 500 Internal Server Error
Server: Lotus-Domino
Date: Tue, 02 Sep 2008 21:35:52 GMT
Connection: close
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=US-ASCII
Content-Length: 661
Cache-Control: no-cache


----------------------------------------------------------
http://planetba.baplc.com/general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes?OpenDocument

GET /general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes?OpenDocument
HTTP/1.1
Host: planetba.baplc.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie:
ObSSOCookie=DdKzZ2Ebcglw9MjchanSFA%2FKN0agvrJTAe6PEGDHOXTeEgfmCrvqYCxVBY0qwU24Xb2T6MV3%2BUwrIfNhKVQA97J54%2Fd2%2BjetZjNoC98N4638eJpf3ZDyE50llsTdOAADaNn%2BjqVfeFrvDjJ2agM1Pxo1Y7DGR0yME1P0%2FHcd6XgFaHwEq9CyUvPq5k6mKMr7Vy4oiZS75RRPAJwNTOxoj7cLuwHX%2Fugj2GJ%2F8Jdynj6Ov1rxgeCWqGdm1ltqEma1TkAbKayt8RtilHwZxRmYDRc3tnGlaqauVUZDNVNE3B3L3bQDyfaFWaDHuX3r67CP

HTTP/1.x 200 OK
Server: Lotus-Domino
Date: Tue, 02 Sep 2008 21:35:52 GMT
Last-Modified: Tue, 02 Sep 2008 21:35:50 GMT
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=ISO-8859-1
Content-Length: 104168
Cache-Control: no-cache
_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/
Mime
View raw message