flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Henrik (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-12382) HA + ResourceManager exception: Fencing token not set
Date Wed, 01 May 2019 13:19:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Henrik updated FLINK-12382:
---------------------------
    Description: 
I'm testing zetcd + session jobs in k8s, and testing what happens when I kill both the job-cluster
and task-manager at the same time, but maintain ZK/zetcd up and running.

Then I get this stacktrace, that's completely non-actionable for me, and also resolves itself.
I expect a number of retries, and if this exception is part of the protocol signalling to
retry, then it should not be printed as a log entry.

This might be related to an older bug: [https://jira.apache.org/jira/browse/FLINK-7734]
{code:java}
[tm] 2019-05-01 11:32:01,641 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor           
- Registration at ResourceManager failed due to an error
[tm] java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException:
Fencing token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e,
RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, HardwareDescription, Time)))
sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing token
is null.
[tm]     at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
[tm]     at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
[tm]     at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
[tm]     at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
[tm]     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
[tm]     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
[tm]     at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815)
[tm]     at akka.dispatch.OnComplete.internal(Future.scala:258)
[tm]     at akka.dispatch.OnComplete.internal(Future.scala:256)
[tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
[tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
[tm]     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
[tm]     at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
[tm]     at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
[tm]     at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
[tm]     at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
[tm]     at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97)
[tm]     at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982)
[tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[tm]     at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446)
[tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
[tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
[tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
[tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
[tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
[tm]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[tm]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[tm]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[tm]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[tm] Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token
not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, RemoteRpcInvocation(registerTaskExecutor(String,
ResourceID, int, HardwareDescription, Time))) sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager
because the fencing token is null.
[tm]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
[tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
[tm]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
[tm]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
[tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
[tm]     ... 9 more
[tm] 2019-05-01 11:32:01,650 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           
- Pausing and re-attempting registration in 10000 ms
[tm] 2019-05-01 11:32:03,070 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           
- The heartbeat of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out.
[tm] 2019-05-01 11:32:03,070 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           
- Close JobManager connection for job 00000000000000000000000000000000.
[tm] 2019-05-01 11:32:03,070 INFO  org.apache.flink.runtime.taskmanager.Task                    
- Attempting to fail task externally Source: Custom Source -> Process -> Timestamps/Watermarks
-> app_events (1/1) (a302013f150f292067cd498100dc6692).
[tm] 2019-05-01 11:32:03,071 INFO  org.apache.flink.runtime.taskmanager.Task                    
- Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) (a302013f150f292067cd498100dc6692)
switched from RUNNING to FAILED.
[tm] org.apache.flink.util.FlinkException: JobManager responsible for 00000000000000000000000000000000
lost the leadership.
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182)
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136)
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624)
[tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
[tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
[tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
[tm]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
[tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
[tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
[tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
[tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
[tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
[tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
[tm]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[tm]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[tm]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[tm]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id
3642aa576f132fecd6811ae0d314c2b5 timed out.
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
[tm]     ... 15 more
[tm] 2019-05-01 11:32:03,071 INFO  org.apache.flink.runtime.taskmanager.Task                    
- Triggering cancellation of task code Source: Custom Source -> Process -> Timestamps/Watermarks
-> app_events (1/1) (a302013f150f292067cd498100dc6692).
[tm] 2019-05-01 11:32:03,085 INFO  org.apache.flink.runtime.taskmanager.Task                    
- Attempting to fail task externally user_sessions -> (Sink: sink_example_sessions, Filter,
Filter) (1/1) (dbb8434fb24a04b8890520d4e59bbd25).
[tm] 2019-05-01 11:32:03,085 INFO  org.apache.flink.runtime.taskmanager.Task                    
- user_sessions -> (Sink: sink_example_sessions, Filter, Filter) (1/1) (dbb8434fb24a04b8890520d4e59bbd25)
switched from RUNNING to FAILED.
[tm] org.apache.flink.util.FlinkException: JobManager responsible for 00000000000000000000000000000000
lost the leadership.
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182)
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136)
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624)
[tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
[tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
[tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
[tm]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
[tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
[tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
[tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
[tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
[tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
[tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
[tm]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[tm]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[tm]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[tm]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id
3642aa576f132fecd6811ae0d314c2b5 timed out.
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
[tm]     ... 15 more
{code}
tm stands for taskmanager in this deployment.

EDIT: this also happens if you just temporarily disable network routing; it never recovers
on its own despite having HA configured! In this case, it's the job manager that keeps crashing.
{code:java}
[job] 2019-05-01 13:03:32,299 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler 
- Unhandled exception.
[job] org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set:
Ignoring message LocalFencedMessage(aa2545e3e2ca903b1a0f331235954917, LocalRpcInvocation(requestMultipleJobDetails(Time)))
sent to akka.tcp://flink@analytics-job:6123/user/dispatcher because the fencing token is null.
[job]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
[job]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
[job]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
[job]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
[job]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[job]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
[job]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
[job]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
[job]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
[job]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
[job]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
[job]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[job]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[job]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[job]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[job] 2019-05-01 13:03:33,308 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  
- Unhandled exception.{code}
 Then killing both the TM and JM/RM doesn't work. I let it linger for 5 minutes in the broken
state, and then:
{code:java}
//snip
Fatal error occurred while executing the TaskManager. Shutting it down.
// snip
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register
at the ResourceManager within the specified maximum registration duration 300000 ms. This
indicates a problem with this instance. Terminating now.
{code}
 

  was:
I'm testing zetcd + session jobs in k8s, and testing what happens when I kill both the job-cluster
and task-manager at the same time, but maintain ZK/zetcd up and running.

Then I get this stacktrace, that's completely non-actionable for me, and also resolves itself.
I expect a number of retries, and if this exception is part of the protocol signalling to
retry, then it should not be printed as a log entry.

This might be related to an older bug: [https://jira.apache.org/jira/browse/FLINK-7734]
{code:java}
[tm] 2019-05-01 11:32:01,641 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor           
- Registration at ResourceManager failed due to an error
[tm] java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException:
Fencing token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e,
RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, HardwareDescription, Time)))
sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing token
is null.
[tm]     at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
[tm]     at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
[tm]     at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
[tm]     at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
[tm]     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
[tm]     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
[tm]     at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815)
[tm]     at akka.dispatch.OnComplete.internal(Future.scala:258)
[tm]     at akka.dispatch.OnComplete.internal(Future.scala:256)
[tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
[tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
[tm]     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
[tm]     at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
[tm]     at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
[tm]     at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
[tm]     at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
[tm]     at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97)
[tm]     at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982)
[tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[tm]     at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446)
[tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
[tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
[tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
[tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
[tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
[tm]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[tm]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[tm]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[tm]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[tm] Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token
not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, RemoteRpcInvocation(registerTaskExecutor(String,
ResourceID, int, HardwareDescription, Time))) sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager
because the fencing token is null.
[tm]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
[tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
[tm]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
[tm]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
[tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
[tm]     ... 9 more
[tm] 2019-05-01 11:32:01,650 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           
- Pausing and re-attempting registration in 10000 ms
[tm] 2019-05-01 11:32:03,070 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           
- The heartbeat of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out.
[tm] 2019-05-01 11:32:03,070 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           
- Close JobManager connection for job 00000000000000000000000000000000.
[tm] 2019-05-01 11:32:03,070 INFO  org.apache.flink.runtime.taskmanager.Task                    
- Attempting to fail task externally Source: Custom Source -> Process -> Timestamps/Watermarks
-> app_events (1/1) (a302013f150f292067cd498100dc6692).
[tm] 2019-05-01 11:32:03,071 INFO  org.apache.flink.runtime.taskmanager.Task                    
- Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) (a302013f150f292067cd498100dc6692)
switched from RUNNING to FAILED.
[tm] org.apache.flink.util.FlinkException: JobManager responsible for 00000000000000000000000000000000
lost the leadership.
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182)
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136)
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624)
[tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
[tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
[tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
[tm]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
[tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
[tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
[tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
[tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
[tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
[tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
[tm]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[tm]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[tm]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[tm]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id
3642aa576f132fecd6811ae0d314c2b5 timed out.
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
[tm]     ... 15 more
[tm] 2019-05-01 11:32:03,071 INFO  org.apache.flink.runtime.taskmanager.Task                    
- Triggering cancellation of task code Source: Custom Source -> Process -> Timestamps/Watermarks
-> app_events (1/1) (a302013f150f292067cd498100dc6692).
[tm] 2019-05-01 11:32:03,085 INFO  org.apache.flink.runtime.taskmanager.Task                    
- Attempting to fail task externally user_sessions -> (Sink: sink_example_sessions, Filter,
Filter) (1/1) (dbb8434fb24a04b8890520d4e59bbd25).
[tm] 2019-05-01 11:32:03,085 INFO  org.apache.flink.runtime.taskmanager.Task                    
- user_sessions -> (Sink: sink_example_sessions, Filter, Filter) (1/1) (dbb8434fb24a04b8890520d4e59bbd25)
switched from RUNNING to FAILED.
[tm] org.apache.flink.util.FlinkException: JobManager responsible for 00000000000000000000000000000000
lost the leadership.
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182)
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136)
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624)
[tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
[tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
[tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
[tm]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
[tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
[tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
[tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
[tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
[tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
[tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
[tm]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[tm]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[tm]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[tm]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id
3642aa576f132fecd6811ae0d314c2b5 timed out.
[tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
[tm]     ... 15 more
{code}
tm stands for taskmanager in this deployment.

EDIT: this also happens if you just temporarily disable network routing; it never recovers
on its own despite having HA configured! In this case, it's the job manager that keeps crashing.
{code:java}
[job] 2019-05-01 13:03:32,299 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler 
- Unhandled exception.
[job] org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set:
Ignoring message LocalFencedMessage(aa2545e3e2ca903b1a0f331235954917, LocalRpcInvocation(requestMultipleJobDetails(Time)))
sent to akka.tcp://flink@analytics-job:6123/user/dispatcher because the fencing token is null.
[job]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
[job]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
[job]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
[job]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
[job]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[job]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
[job]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
[job]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
[job]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
[job]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
[job]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
[job]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[job]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[job]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[job]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[job] 2019-05-01 13:03:33,308 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  
- Unhandled exception.{code}
 

 


> HA + ResourceManager exception: Fencing token not set
> -----------------------------------------------------
>
>                 Key: FLINK-12382
>                 URL: https://issues.apache.org/jira/browse/FLINK-12382
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.8.0
>         Environment: Same all all previous bugs filed by myself, today, but this time
with HA with zetcd.
>            Reporter: Henrik
>            Priority: Major
>
> I'm testing zetcd + session jobs in k8s, and testing what happens when I kill both the
job-cluster and task-manager at the same time, but maintain ZK/zetcd up and running.
> Then I get this stacktrace, that's completely non-actionable for me, and also resolves
itself. I expect a number of retries, and if this exception is part of the protocol signalling
to retry, then it should not be printed as a log entry.
> This might be related to an older bug: [https://jira.apache.org/jira/browse/FLINK-7734]
> {code:java}
> [tm] 2019-05-01 11:32:01,641 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor           
- Registration at ResourceManager failed due to an error
> [tm] java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException:
Fencing token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e,
RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, HardwareDescription, Time)))
sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing token
is null.
> [tm]     at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> [tm]     at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> [tm]     at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> [tm]     at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> [tm]     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> [tm]     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> [tm]     at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815)
> [tm]     at akka.dispatch.OnComplete.internal(Future.scala:258)
> [tm]     at akka.dispatch.OnComplete.internal(Future.scala:256)
> [tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> [tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> [tm]     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> [tm]     at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
> [tm]     at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> [tm]     at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> [tm]     at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
> [tm]     at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97)
> [tm]     at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982)
> [tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm]     at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446)
> [tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [tm]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [tm]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [tm]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [tm]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [tm] Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing
token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, RemoteRpcInvocation(registerTaskExecutor(String,
ResourceID, int, HardwareDescription, Time))) sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager
because the fencing token is null.
> [tm]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
> [tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [tm]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> [tm]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [tm]     ... 9 more
> [tm] 2019-05-01 11:32:01,650 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           
- Pausing and re-attempting registration in 10000 ms
> [tm] 2019-05-01 11:32:03,070 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           
- The heartbeat of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out.
> [tm] 2019-05-01 11:32:03,070 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           
- Close JobManager connection for job 00000000000000000000000000000000.
> [tm] 2019-05-01 11:32:03,070 INFO  org.apache.flink.runtime.taskmanager.Task                    
- Attempting to fail task externally Source: Custom Source -> Process -> Timestamps/Watermarks
-> app_events (1/1) (a302013f150f292067cd498100dc6692).
> [tm] 2019-05-01 11:32:03,071 INFO  org.apache.flink.runtime.taskmanager.Task                    
- Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) (a302013f150f292067cd498100dc6692)
switched from RUNNING to FAILED.
> [tm] org.apache.flink.util.FlinkException: JobManager responsible for 00000000000000000000000000000000
lost the leadership.
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182)
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136)
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624)
> [tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
> [tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
> [tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [tm]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [tm]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [tm]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [tm]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [tm]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with
id 3642aa576f132fecd6811ae0d314c2b5 timed out.
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
> [tm]     ... 15 more
> [tm] 2019-05-01 11:32:03,071 INFO  org.apache.flink.runtime.taskmanager.Task                    
- Triggering cancellation of task code Source: Custom Source -> Process -> Timestamps/Watermarks
-> app_events (1/1) (a302013f150f292067cd498100dc6692).
> [tm] 2019-05-01 11:32:03,085 INFO  org.apache.flink.runtime.taskmanager.Task                    
- Attempting to fail task externally user_sessions -> (Sink: sink_example_sessions, Filter,
Filter) (1/1) (dbb8434fb24a04b8890520d4e59bbd25).
> [tm] 2019-05-01 11:32:03,085 INFO  org.apache.flink.runtime.taskmanager.Task                    
- user_sessions -> (Sink: sink_example_sessions, Filter, Filter) (1/1) (dbb8434fb24a04b8890520d4e59bbd25)
switched from RUNNING to FAILED.
> [tm] org.apache.flink.util.FlinkException: JobManager responsible for 00000000000000000000000000000000
lost the leadership.
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182)
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136)
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624)
> [tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
> [tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
> [tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [tm]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [tm]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [tm]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [tm]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [tm]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with
id 3642aa576f132fecd6811ae0d314c2b5 timed out.
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
> [tm]     ... 15 more
> {code}
> tm stands for taskmanager in this deployment.
> EDIT: this also happens if you just temporarily disable network routing; it never recovers
on its own despite having HA configured! In this case, it's the job manager that keeps crashing.
> {code:java}
> [job] 2019-05-01 13:03:32,299 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler 
- Unhandled exception.
> [job] org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not
set: Ignoring message LocalFencedMessage(aa2545e3e2ca903b1a0f331235954917, LocalRpcInvocation(requestMultipleJobDetails(Time)))
sent to akka.tcp://flink@analytics-job:6123/user/dispatcher because the fencing token is null.
> [job]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
> [job]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [job]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> [job]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [job]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [job]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [job]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [job]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [job]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [job]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [job]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [job]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [job]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [job]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [job]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [job] 2019-05-01 13:03:33,308 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  
- Unhandled exception.{code}
>  Then killing both the TM and JM/RM doesn't work. I let it linger for 5 minutes in the
broken state, and then:
> {code:java}
> //snip
> Fatal error occurred while executing the TaskManager. Shutting it down.
> // snip
> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could
not register at the ResourceManager within the specified maximum registration duration 300000
ms. This indicates a problem with this instance. Terminating now.
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message