flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: Task Manager fault tolerance does not work
Date Tue, 03 Apr 2018 13:42:35 GMT
There is a JIRA issue for the problem:
https://issues.apache.org/jira/browse/FLINK-9120. Mirroring my response to
this thread:

The logs (attached to the JIRA ticket) show that the JM did not yet
recognize the killed TM as killed when trying to restart. Thus, it tries to
re-deploy tasks to this machine. When it finally realizes that the TM has
been killed, it fails the jobs. At this point, it would try to recover the
job. However, since the number of restart attempts are depleted (set to 3),
it will fail the job terminally. Please try to raise the number of retry
attempts. This should hopefully fix your problem.


On Tue, Apr 3, 2018 at 3:26 PM, Timo Walther <twalthr@apache.org> wrote:

> @Till: Do you have any advice for this issue?
> Am 03.04.18 um 11:54 schrieb dhirajpraj:
> What I have found is that the TM fault tolerance behaviour is not
>> consistent.
>> Sometimes it works and sometimes it doesnt. I am attaching my java code
>> file
>> (which is the main class).
>> What I did was:
>> 1) Run cluster with JM on machine A, one TM on machine B and one TM on
>> machine C
>> 2) Submit a job to the cluster. Works fine till now.
>> 3) Forcefully kill the TM on machine C. The web UI shows job failing and
>> then restarting and finally the job is up on its own. This is perfect.
>> 4) Now I start the TM on machine C and wait for sufficient time
>> 5) Now kill the TM on machine B. At this point the job fails. Shouldnt the
>> job be handled by the running TM on machine C? FlinkPatternDetection.java
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/file/t1400/FlinkPatternDetection.java>
>> --
>> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/

View raw message