flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4717) Naive version of atomic stop signal with savepoint
Date Tue, 11 Oct 2016 15:14:26 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565708#comment-15565708
] 

ASF GitHub Bot commented on FLINK-4717:
---------------------------------------

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2609#discussion_r82814608
  
    --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
---
    @@ -581,6 +581,62 @@ class JobManager(
               )
           }
     
    +    case CancelJobWithSavepoint(jobId, savepointDirectory) =>
    +      try {
    +        val targetDirectory = if (savepointDirectory != null) {
    +          savepointDirectory
    +        } else {
    +          defaultSavepointDir
    +        }
    +
    +        log.info(s"Trying to cancel job $jobId with savepoint to $targetDirectory")
    +
    +        currentJobs.get(jobId) match {
    +          case Some((executionGraph, _)) =>
    +            // We don't want any checkpoint between the savepoint and cancellation
    +            val coord = executionGraph.getCheckpointCoordinator
    +            coord.stopCheckpointScheduler()
    --- End diff --
    
    I think it's not enough to simply call `stopCheckpointScheduler`. If I'm not mistaken,
then the following could happen: You call `stopCheckpointScheduler` which will try to `cancel`
the last `currentPeriodicTrigger`. Now assume that the last `TimerTask` to trigger the next
checkpoint has just been triggered but not executed (just before cancelling it). Now the `stopCheckpointScheduler`
finishes without the `TimerTask` having completed. Now the `TimerTask` can still trigger a
checkpoint even though we've stopped the checkpoint scheduler.
    
    The way to fix this (admittedly academic corner case), is to filter out outdated `TimerTask`
calls in the `CheckpointCoordinator` by having a kind of fencing tokens for the trigger checkpoint
calls.


> Naive version of atomic stop signal with savepoint
> --------------------------------------------------
>
>                 Key: FLINK-4717
>                 URL: https://issues.apache.org/jira/browse/FLINK-4717
>             Project: Flink
>          Issue Type: New Feature
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.2.0
>            Reporter: Till Rohrmann
>            Priority: Minor
>             Fix For: 1.2.0
>
>
> As a first step towards atomic stopping with savepoints we should implement a cancel
command which prior to cancelling takes a savepoint. Additionally, it should turn off the
periodic checkpointing so that there won't be checkpoints executed between the savepoint and
the cancel command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message