flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Metzger (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4142) Recovery problem in HA on Hadoop Yarn 2.4.1
Date Fri, 15 Jul 2016 09:13:20 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15379079#comment-15379079
] 

Robert Metzger commented on FLINK-4142:
---------------------------------------

Thank you for posting a log as well.

It seems to be a YARN specific issue:
{code}
2016-07-01 15:45:03,452 INFO  org.apache.flink.yarn.YarnFlinkResourceManager             
  - Launching TaskManager in container ContainerInLaunch @ 1467387903451: Container: [ContainerId:
container_1467387670862_0001_02_000002, NodeId: hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal:40436,
NodeHttpAddress: hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal:8042, Resource:
<memory:4096, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service:
10.240.0.18:40436 }, ] on host hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal
2016-07-01 15:45:03,455 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy
 - Opening proxy : hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal:40436
2016-07-01 15:45:03,508 ERROR org.apache.flink.yarn.YarnFlinkResourceManager             
  - Could not start TaskManager in container ContainerInLaunch @ 1467387903451: Container:
[ContainerId: container_1467387670862_0001_02_000002, NodeId: hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal:40436,
NodeHttpAddress: hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal:8042, Resource:
<memory:4096, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service:
10.240.0.18:40436 }, ]
org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.

NMToken for application attempt : appattempt_1467387670862_0001_000001 was used for starting
container with container token issued for application attempt : appattempt_1467387670862_0001_000002
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
	at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
	at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:206)
	at org.apache.flink.yarn.YarnFlinkResourceManager.containersAllocated(YarnFlinkResourceManager.java:403)
	at org.apache.flink.yarn.YarnFlinkResourceManager.handleMessage(YarnFlinkResourceManager.java:164)
	at org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:90)
	at org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:70)
	at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
	at akka.dispatch.Mailbox.run(Mailbox.scala:221)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2016-07-01 15:45:03,508 INFO  org.apache.flink.yarn.YarnFlinkResourceManager             
  - Requesting new TaskManager container with 4096 megabytes memory. Pending requests: 1
2016-07-01 15:45:03,959 INFO  org.apache.flink.yarn.YarnFlinkResourceManager             
  - Container ResourceID{resourceId='container_1467387670862_0001_02_000002'} completed successfully
with diagnostics: Container released by application
{code}

The problem was a major bug in Hadoop 2.4.0. It has been fixed in Hadoop 2.5.0. 
https://issues.apache.org/jira/browse/YARN-2065

I'll add a warning to the YARN documentation page that there are issues with HA on YARN <
2.5.0.

> Recovery problem in HA on Hadoop Yarn 2.4.1
> -------------------------------------------
>
>                 Key: FLINK-4142
>                 URL: https://issues.apache.org/jira/browse/FLINK-4142
>             Project: Flink
>          Issue Type: Bug
>          Components: YARN Client
>    Affects Versions: 1.0.3
>            Reporter: Stefan Richter
>
> On Hadoop Yarn 2.4.1, recovery in HA fails in the following scenario:
> 1) Kill application master, let it recover normally.
> 2) After that, kill a task manager.
> Now, Yarn tries to restart the killed task manager in an endless loop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message