flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (Jira)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-21979) Job can be restarted from the beginning after it reached a terminal state
Date Thu, 20 May 2021 08:02:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-21979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Till Rohrmann updated FLINK-21979:
    Labels:   (was: stale-critical)

> Job can be restarted from the beginning after it reached a terminal state
> -------------------------------------------------------------------------
>                 Key: FLINK-21979
>                 URL: https://issues.apache.org/jira/browse/FLINK-21979
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.3, 1.12.2, 1.13.0
>            Reporter: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.14.0
> Currently, the {{JobMaster}} removes all checkpoints after a job reaches a globally terminal
state. Then it notifies the {{Dispatcher}} about the termination of the job. The {{Dispatcher}}
then removes the job from the {{SubmittedJobGraphStore}}. If the {{Dispatcher}} process fails
before doing that it might get restarted. In this case, the {{Dispatcher}} would still find
the job in the {{SubmittedJobGraphStore}} and recover it. Since the {{CompletedCheckpointStore}}
is empty, it would start executing this job from the beginning.
> I think we must not remove job state before the job has not been marked as done or made
inaccessible for any restarted processes. Concretely, we should first remove the job from
the {{SubmittedJobGraphStore}} and only then delete the checkpoints. Ideally all the job related
cleanup operation happens atomically.

This message was sent by Atlassian Jira

View raw message