There is a JIRA issue for the problem: https://issues.apache.org/jira/browse/FLINK-9120
. Mirroring my response to this thread:
The logs (attached to the JIRA ticket) show that the JM did not yet recognize the killed TM as killed when trying to restart. Thus, it tries to re-deploy tasks to this machine. When it finally realizes that the TM has been killed, it fails the jobs. At this point, it would try to recover the job. However, since the number of restart attempts are depleted (set to 3), it will fail the job terminally. Please try to raise the number of retry attempts. This should hopefully fix your problem.