flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] tillrohrmann opened a new pull request #6587: [FLINK-10011] Release JobGraph from SubmittedJobGraphStore
Date Mon, 20 Aug 2018 22:47:50 GMT
tillrohrmann opened a new pull request #6587: [FLINK-10011] Release JobGraph from SubmittedJobGraphStore
URL: https://github.com/apache/flink/pull/6587
 
 
   ## What is the purpose of the change
   
   This PR fixes the problem that sometimes `JobGraphs` cannot be removed from the `ZooKeeperSubmittedJobGraphStore`
because a former leader might still keep a lock on the `JobGraph`. This usually happens in
multi stand-by JobManager/Dispatcher scenarios, where a leader loses leadership due to a temporary
network glitch but can restore its connection to ZooKeeper. The lock nodes, which are ephemeral
and are created to protect against concurrent deletions, won't be deleted in this case and,
thus, the `JobGraph` won't be removable by the new leader.
   
   The problem will be solved by explicitly removing all locks a `JobManager`/`Dispatcher`
keeps on the stored `JobGraphs` if it loses leadership.
   
   This PR is based on #6586 
   
   ## Brief change log
   
   SubmitedJobGraphStore#releaseJobGraph removes a potentially existing lock
   from the specified JobGraph. This allows other SubmittedJobGraphStores to
   remove the JobGraph given that it is no longer locked.
   
   The JobManager now releases its lock on all JobGraphs it has stored in
   the SubmittedJobGraphStore if the JobManager loses leadership. This ensures
   that a different JobManager can delete these jobs after it has recovered
   them and reached a globally terminal state. This is especially important
   when using stand-by JobManagers where a former leader might still be
   connected to ZooKeeper and, thus, keeping all ephemeral nodes/locks.
   
   The Dispatcher now releases all JobGraphs it has stored in the SubmittedJobGraphStore
   if it loses leadership. This ensures that the newly elected leader after recovering
   the jobs can remove them from the SubmittedJobGraphStore. Before, the problem was
   that a former leader might still be connected to ZooKeeper which keeps its ephemeral
   lock nodes alive. This could prevent the deletion of the JobGraph from ZooKeeper.
   The problem occurs in particular in multi stand-by Dispatcher scenarios.
   
   
   ## Verifying this change
   
   - Added `ZooKeeperHAJobManagerTest#testSubmittedJobGraphRelease` and `ZooKeeperHADispatcherTest#testSubmittedJobGraphRelease`
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (no)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
     - The serializers: (no)
     - The runtime per-record code paths (performance sensitive): (no)
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing,
Yarn/Mesos, ZooKeeper: (yes)
     - The S3 file system connector: (no)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (no)
     - If yes, how is the feature documented? (not applicable)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message