flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Ewen (Jira)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-5763) Make savepoints self-contained and relocatable
Date Thu, 16 Jan 2020 10:52:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016805#comment-17016805

Stephan Ewen commented on FLINK-5763:

Let's reboot this issue. The original PRs are a bit outdated by now.

A rough idea how to do this would be:
  * Initially support this only for savepoint/full checkpoints, where all state and metadata
is contained in the _exclusive checkpoint directory_.
  * Have a field in the savepoint meatadata that defines the {{FileStateHandles}} to have
relative paths, relative to the {{CheckpointStorageLocation}} (the exclusive location part).
  * When writing the state handles in the savepoint metadata writer, drop the prefix equivalent
to the {{CheckpointStorageLocation}}'s exclusive path.
  * When reading the state handles in the savepoint metadata reader, prepend the the {{CheckpointStorageLocation}}'s
exclusive path.
  * This should need modification only in the metadata readers / writers, no other part of
the code.

> Make savepoints self-contained and relocatable
> ----------------------------------------------
>                 Key: FLINK-5763
>                 URL: https://issues.apache.org/jira/browse/FLINK-5763
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / State Backends
>            Reporter: Ufuk Celebi
>            Priority: Critical
>              Labels: usability
>             Fix For: 1.11.0
> After a user has triggered a savepoint, a single savepoint file will be returned as a
handle to the savepoint. A savepoint to {{<target>}} creates a savepoint file like {{<target>/savepoint-<randomSuffix>}}.
> This file contains the metadata of the corresponding checkpoint, but not the actual program
state. While this works well for short term management (pause-and-resume a job), it makes
it hard to manage savepoints over longer periods of time.
> h4. Problems
> h5. Scattered Checkpoint Files
> For file system based checkpoints (FsStateBackend, RocksDBStateBackend) this results
in the savepoint referencing files from the checkpoint directory (usually different than <target>).
For users, it is virtually impossible to tell which checkpoint files belong to a savepoint
and which are lingering around. This can easily lead to accidentally invalidating a savepoint
by deleting checkpoint files.
> h5. Savepoints Not Relocatable
> Even if a user is able to figure out which checkpoint files belong to a savepoint, moving
these files will invalidate the savepoint as well, because the metadata file references absolute
file paths.
> h5. Forced to Use CLI for Disposal
> Because of the scattered files, the user is in practice forced to use Flink’s CLI to
dispose a savepoint. This should be possible to handle in the scope of the user’s environment
via a file system delete operation.
> h4. Proposal
> In order to solve the described problems, savepoints should contain all their state,
both metadata and program state, inside a single directory. Furthermore the metadata must
only hold relative references to the checkpoint files. This makes it obvious which files make
up the state of a savepoint and it is possible to move savepoints around by moving the savepoint
> h5. Desired File Layout
> Triggering a savepoint to {{<target>}} creates a directory as follows:
> {code}
> <target>/savepoint-<jobId>-<randomSuffix>
>   +-- _metadata
>   +-- data-<randomSuffix> [1 or more]
> {code}
> We include the JobID in the savepoint directory name in order to give some hints about
which job a savepoint belongs to.
> h5. CLI
> - Trigger: When triggering a savepoint to {{<target>}} the savepoint directory
will be returned as the handle to the savepoint.
> - Restore: Users can restore by pointing to the directory or the _metadata file. The
data files should be required to be in the same directory as the _metadata file.
> - Dispose: The disposal command should be deprecated and eventually removed. While deprecated,
disposal can happen by specifying the directory or the _metadata file (same as restore).

This message was sent by Atlassian Jira

View raw message