hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uma Maheswara Rao G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-7488) When Namenode network is unplugged, DFSClient operations waits for ever
Date Tue, 16 Aug 2011 14:57:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085760#comment-13085760

Uma Maheswara Rao G commented on HADOOP-7488:

Hi Konstantin,

Thanks alot for taking a look on this issue.

If rpcTimeout > 0 then {{ handleTimeout()}} will throw SocketTimeoutException instead of
going into ping loop. Can you control the required behavior by setting rpcTimeout > 0 rather
introducing the # of pings limit.
 Yes, with this parameter also, we can control.

 I am planning to add below code in DataNode when gettng the proxy.

        // get NN proxy
      DatanodeProtocol dnp = 
            DatanodeProtocol.versionID, nnAddr, conf, socketTimeout,

  Here the sockettimeout is rpcTimeOut. 
 this property already used for createInterDataNodeProtocolProxy as rpcTimeOut.
 this.socketTimeout =  conf.getInt(DFS_CLIENT_SOCKET_TIMEOUT_KEY,

But my question is, if i use socketTimeout (default 60*1000 ms) as rpcTimeOut, default behaviour
will be changed. I dont want to change the default behavior here.
 any suggestion for this? 

DataNodes and TaskTrackers are designed to ping NN and JT infinitely, because during startup
you cannot predict when NN will come online as it depends on the size of the image and edits.
Also when NN becomes busy it is important for DNs to keep retrying rather than assuming the
NN is dead.

Yes. But there are some scenarios like network unplug may thorugh tomeouts and because of
the timeout handlings, unneccerily system will be blocked for long time.
As i know, even if we through that timeout exception out to JT or DN, they will handle it
and retry again in their offerService methods.
except in below condition
 catch(RemoteException re) {
          String reClass = re.getClassName();
          if (UnregisteredNodeException.class.getName().equals(reClass) ||
              DisallowedDatanodeException.class.getName().equals(reClass) ||
              IncorrectVersionException.class.getName().equals(reClass)) {
            LOG.warn("blockpool " + blockPoolId + " is shutting down", re);
            shouldServiceRun = false;

And even if they don't this should be an HDFS change not generic IPC change, which affects
many Hadoop components
 What i felt is, this particular issue will be applicable for all the components who is using
Hadoop IPC. And also planned to retain the default behaviour as it is to not effect the other
componenets. and if user really required then he will tune the configuration parameter based
on his requirement.

Anyway we decided to use rcpTimeOut right, IPC user code only should pass this value. In that
case this will come under HDFS specific chnage. Also need to check the for MapReduce as well
( same situation for JT) 

As for HA I don't know what you did for HA and therefore cannot understand what problem you
are trying to solve here. I can guess that you want DNs switch to another NN when they timeout
rather than retrying. In this case you should be able to use rpcTimeout
 Yes, your guess is correct :-)
 In our HA solution, we are using *BackupNode* and Switching framework is *Zookeeper based*
LeaderElection. DNs will contain both the active and standby node addresses configured. On
any failure, DNs will try to switch to other NN. 
 Here the scenario is, We unplugged the active NN network card, then all DN are blocked for
long time.


> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>                 Key: HADOOP-7488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7488
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HADOOP-7488.patch
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response
from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are waiting for
a response from NN/DN, waits for ever.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message