hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rahul Anand (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-8855) Application submission fails if one of the sublcluster is down.
Date Mon, 08 Oct 2018 11:33:00 GMT
Rahul Anand created YARN-8855:
---------------------------------

             Summary: Application submission fails if one of the sublcluster is down.
                 Key: YARN-8855
                 URL: https://issues.apache.org/jira/browse/YARN-8855
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Rahul Anand


If one of sub cluster is down then application keeps on trying multiple times and then it
fails About 30 failover attempts found in the logs. Below is the detailed exception. 
{code:java}
2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container container_e03_1538297667953_0005_01_000001
transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093
2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing container_e03_1538297667953_0005_01_000001
from application application_1538297667953_0005 | ApplicationImpl.java:512
2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping resource-monitoring
for container_e03_1538297667953_0005_01_000001 | ContainersMonitorImpl.java:932
2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering container container_e03_1538297667953_0005_01_000001
for log-aggregation | AppLogAggregatorImpl.java:538
2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event CONTAINER_STOP
for appId application_1538297667953_0005 | AuxServices.java:350
2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping container container_e03_1538297667953_0005_01_000001
| YarnShuffleService.java:295
2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find container container_e03_1538297667953_0005_01_000001
while processing FINISH_CONTAINERS event | ContainerManagerImpl.java:1660
2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed containers from NM
context: [container_e03_1538297667953_0005_01_000001] | NodeStatusUpdaterImpl.java:696
2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the ResourceManager for
SubClusterId: cluster2 | FederationRMFailoverProxyProvider.java:124
2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from cache and rehydrating
from store, most likely on account of RM failover. | FederationStateStoreFacade.java:258
2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to /192.168.0.25:8032 subClusterId
cluster2 with protocol ApplicationClientProtocol as user root (auth:SIMPLE) | FederationRMFailoverProxyProvider.java:145
2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64
to node-master1-IYTxR:8032 failed on connection exception: java.net.ConnectException: Connection
refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking
ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 28 failover attempts.
Trying to failover after sleeping for 15261ms. | RetryInvocationHandler.java:411
2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the ResourceManager for
SubClusterId: cluster2 | FederationRMFailoverProxyProvider.java:124
2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from cache and rehydrating
from store, most likely on account of RM failover. | FederationStateStoreFacade.java:258
2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to /192.168.0.25:8032 subClusterId
cluster2 with protocol ApplicationClientProtocol as user root (auth:SIMPLE) | FederationRMFailoverProxyProvider.java:145
2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64
to node-master1-IYTxR:8032 failed on connection exception: java.net.ConnectException: Connection
refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking
ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 29 failover attempts.
Trying to failover after sleeping for 21175ms. | RetryInvocationHandler.java:411
2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the ResourceManager for
SubClusterId: cluster2 | FederationRMFailoverProxyProvider.java:124
2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Flushing subClusters from cache and rehydrating
from store, most likely on account of RM failover. | FederationStateStoreFacade.java:258
2018-10-08 14:22:03,186 | INFO | pool-16-thread-1 | Connecting to /192.168.0.25:8032 subClusterId
cluster2 with protocol ApplicationClientProtocol as user root (auth:SIMPLE) | FederationRMFailoverProxyProvider.java:145
2018-10-08 14:22:03,189 | ERROR | pool-16-thread-1 | Failed to register application master:
cluster2 Application: appattempt_1538297667953_0005_000001 | FederationInterceptor.java:1106
java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to node-master1-IYTxR:8032
failed on connection exception: java.net.ConnectException: Connection refused; For more details
see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor59.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:755) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1517)
at org.apache.hadoop.ipc.Client.call(Client.java:1459)
{code}
cc [~botong] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org


Mime
View raw message