flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "vinoyang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-11159) Allow configuration whether to fall back to savepoints for restore
Date Tue, 16 Apr 2019 12:49:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-11159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818974#comment-16818974

vinoyang commented on FLINK-11159:

[~till.rohrmann] We meet an issue about savepoint just now but is not about recovery. We configurated
allow the checkpoint to retain, and let the max number of the checkpoint to retain equals
1. If we cancel job with savepoint, then there is no one checkpoint in the HDFS, because the
savepoint is stored in completed checkpoint store, and the finished savepoint triggered the
prior completed checkpoint been discarded. I think this behavior is not expected unless the
user is familiar with the source code. Actually, we thought it is a bug before watching the
source code.

What do you think? [~StephanEwen]

It seems the savepoint is stored in completed checkpoint store caused many problems no matter snapshotting
or recovery.

> Allow configuration whether to fall back to savepoints for restore
> ------------------------------------------------------------------
>                 Key: FLINK-11159
>                 URL: https://issues.apache.org/jira/browse/FLINK-11159
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.5.5, 1.6.2, 1.7.0
>            Reporter: Nico Kruber
>            Assignee: vinoyang
>            Priority: Major
> Ever since FLINK-3397, upon failure, Flink would restart from the latest checkpoint/savepoint
which ever is more recent. With the introduction of local recovery and the knowledge that
a RocksDB checkpoint restore would just copy the files, it may be time to re-consider / making
this configurable:
> In certain situations, it may be faster to restore from the latest checkpoint only (even
if there is a more recent savepoint) and reprocess the data between. On the downside, though,
that may not be correct because that might break side effects if the savepoint was the latest
one, e.g. consider this chain: {{chk1 -> chk2 -> sp … restore chk2 -> …}}. Then
all side effects between {{chk2 -> sp}} would be reproduced.
> Making this configurable will allow the user to set whatever he needs / can to get the
lowest recovery time in Flink.

This message was sent by Atlassian JIRA

View raw message