flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Metzger (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-6625) Flink removes HA job data when reaching JobStatus.FAILED
Date Thu, 11 Oct 2018 11:55:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646333#comment-16646333
] 

Robert Metzger commented on FLINK-6625:
---------------------------------------

Can we guarantee that checkpoints are consistent when a job has finished with FAILED?

Can't a failed checkpoint (unable to commit something, write somewhere, etc.) fail the job?
Such an incomplete checkpoint would make HA-recovery impossible.

> Flink removes HA job data when reaching JobStatus.FAILED
> --------------------------------------------------------
>
>                 Key: FLINK-6625
>                 URL: https://issues.apache.org/jira/browse/FLINK-6625
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Coordination
>    Affects Versions: 1.3.0, 1.4.0
>            Reporter: Till Rohrmann
>            Priority: Major
>
> Currently, Flink removes all job related data (submitted {{JobGraph}} as well as checkpoints)
when it reaches a globally terminal state (including {{JobStatus.FAILED}}). In high availability
mode, this entails that all data is removed from ZooKeeper and there is no way to recover
the job by restarting the cluster with the same cluster id.
> I think this is problematic, since an application might just have failed because it has
depleted its numbers of restart attempts. Also the last checkpoint information could be helpful
when trying to find out why the job has actually failed. I propose that we only remove job
data when reaching the state {{JobStatus.SUCCESS}} or {{JobStatus.CANCELED}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message