flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From StephanEwen <...@git.apache.org>
Subject [GitHub] flink issue #2696: [FLINK-3347] [akka] Add QuarantineMonitor which shuts a q...
Date Thu, 27 Oct 2016 12:02:51 GMT
Github user StephanEwen commented on the issue:

    That's a very good addition, we need something like that.
    After an offline discussion with @tillrohrmann we came to the following conclusion:
    There is a tricky problem with that pure appraoch: When the JobManager fails, all TaskManagers
will "quarantine" that JobManager's actor system after they detected the failure. That means
they exit and restart. Effectively, a JobManager failure results in all TaskManagers restarting.
    That is a bit heavy.
    Instead, we'll adjust this to do the following:
      - TaskManagers must not watch the JobManager via Akka. That way, JobManager failures
do not cause any quarantining on the TaskManager side.
      - The JobManager keeps watching the TaskManagers via Akka, so TaskManager failures (false
positives) still result in TaskManager quarantine, which means the TaskManager need to restart
when a TM-JM link breaks
    How do TaskManagers then detect JobManager failure?
      - TaskManagers send heartbeats to the JobManager anyways (accumulators, in the future
task status reconciliation). The TaskManagers use that to detect JobManager failures.
      - In high availability setups, TaskManagers notice JobManager failure also via ZooKeeper
      - In addition, in flip-6 the resource manager tells TaskManagers about JobManager container

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.

View raw message