nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols
Date Sun, 18 May 2014 03:45:17 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000981#comment-14000981
] 

Hudson commented on NUTCH-1613:
-------------------------------

SUCCESS: Integrated in Nutch-trunk #2630 (See [https://builds.apache.org/job/Nutch-trunk/2630/])
NUTCH-1613 Timeouts in protocol-httpclient when crawling same host with >2 threads (jnioche:
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1593951)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java


> Timeouts in protocol-httpclient when crawling same host with >2 threads and added
cookie strings for both http protocols
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1613
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1613
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 2.2.1
>            Reporter: Brian
>            Priority: Minor
>              Labels: patch
>             Fix For: 2.3, 1.9
>
>         Attachments: NUTCH-1613.patch
>
>
> 1.)  When using protocol-httpclient to crawl a single website (the same host) I would
always get a bunch of timeout errors during fetching and the pages with errors would not be
fetched. E.g.:
> 2013-07-09 17:57:13,717 WARN  fetcher.FetcherJob - fetch of http://www.... failed with:
org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting for connection
> 2013-07-09 17:57:13,718 INFO  fetcher.FetcherJob - fetching http://www.... (queue crawl
delay=0ms)
> 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following error: 
> org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting for connection
> 	at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497)
> 	at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
> 	at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
> 	at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
> 	at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
> 	at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:95)
> 	at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174)
> 	at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133)
> 	at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518)
> This is because by default the connection pool manager only allows 2 connections per
host so if more than 2 threads are used the others will tend to time out waiting to get a
connection.   The code previously set max connections correctly but not connection per host.
> 2.) I also added at the same time simple modifications to both protocol-http and protocol-httpclient
to allow specifying a cookie string in the conf file to include in request headers.  
> I use this to crawl site content requiring authentication - it is better for me to specify
the cookie string for the authentication than go through the whole authentication process
and specifying login info.
> The nutch-site.xml property is the following:
> <property>
>         <name>http.cookie_string</name>
>         <value>XX_AL=authorization_value_goes_here</value>
> 		<description>String to use as the cookie value for HTTP requests</description>
> </property>
> Although I use it for authentication it can be used to specify any single cookie string
for the crawl (httpclient does support different cookies for different hosts but I did not
get into that).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message