nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrian Newby (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-1836) Timeouts in protocol-httpclient when crawling same host with >2 threads NUTCH-1613 is not a complete solution
Date Sun, 07 Sep 2014 18:08:28 GMT
Adrian Newby created NUTCH-1836:
-----------------------------------

             Summary: Timeouts in protocol-httpclient when crawling same host with >2 threads
NUTCH-1613 is not a complete solution
                 Key: NUTCH-1836
                 URL: https://issues.apache.org/jira/browse/NUTCH-1836
             Project: Nutch
          Issue Type: Improvement
          Components: protocol
    Affects Versions: 1.9
            Reporter: Adrian Newby
            Priority: Minor


NUTCH-1613 provided a fix for the hardcoded limitation of 2 threads for protocol-httpclient.
 However, just extending the hardwired 10 max threads and allocating them all to a single
host only provides a partial solution.  It is still possible to exhaust the thread pool and
observe timeouts depending on the settings of:

 - fetcher.threads.per.host (nutch-site.xml)
 - mapred.tasktracker.map.tasks.maximum (mapred-site.xml)

It would perhaps be more robust to set the httpclient thread pool as a derivative of these
two configuration parameters as below:



{code}
    params.setMaxTotalConnections(maxThreadsTotal);

// Add the following lines ...


	// --------------------------------------------------------------------------------
	// Modification to increase the number of available connections for
	// multi-threaded crawls.
	// --------------------------------------------------------------------------------
	connectionManager.setMaxConnectionsPerHost(conf.getInt("fetcher.threads.per.host", 10));
	connectionManager.setMaxTotalConnections(conf.getInt("mapred.tasktracker.map.tasks.maximum",
5) * conf.getInt("fetcher.threads.per.host", 10));
	LOG.debug("setMaxConnectionsPerHost: " + connectionManager.getMaxConnectionsPerHost());
	LOG.debug("setMaxTotalConnections  : " + connectionManager.getMaxTotalConnections());
	// --------------------------------------------------------------------------------
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message