flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gyula Fora (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-12144) Option to prefer checkpoints on recovery
Date Tue, 09 Apr 2019 09:44:00 GMT
Gyula Fora created FLINK-12144:
----------------------------------

             Summary: Option to prefer checkpoints on recovery
                 Key: FLINK-12144
                 URL: https://issues.apache.org/jira/browse/FLINK-12144
             Project: Flink
          Issue Type: New Feature
          Components: Runtime / Checkpointing
            Reporter: Gyula Fora


When a streaming job fails the getLatestCheckpoint() of the CheckpointStore is used to determine
which checkpoint or savepoint is going to be used for recovery.

This behaviour is perfectly fine for jobs with relatively small states or where there are
no strong SLAs but it some cases it can be problematic.

For jobs with a very large state size, the difference between recovery times from savepoints
and checkpoints can be substantial to the point where it might break a use-case. So we would
like to avoid ever recovering from a savepoint if a not too old checkpoint is also readily
available.

This cannot be avoided right now if a job fails after we took a savepoint maybe for backup
purposes (maybe it is scheduled multiple times a day).

I suggest we add a configuration option to allow the job to fall back to an earlier checkpoint
(within maybe a certain age limit) even if there is a newer savepoint available to avoid lengthy
downtimes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message