flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julius Michaelis (Jira)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files
Date Thu, 17 Oct 2019 01:18:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953299#comment-16953299

Julius Michaelis commented on FLINK-13808:

As for something else that may be easier to implement than a cancellation message: why don't
the taskmanagers also cancel checkpoints based on timeouts? (Of course, that doesn't catch
other cancellation reasons, but…)

> Checkpoints expired by timeout may leak RocksDB files
> -----------------------------------------------------
>                 Key: FLINK-13808
>                 URL: https://issues.apache.org/jira/browse/FLINK-13808
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / State Backends
>    Affects Versions: 1.8.0, 1.8.1
>         Environment: So far only reliably reproducible on a 4-node cluster with parallelism
≥ 100. But do try https://github.com/jcaesar/flink-rocksdb-file-leak
>            Reporter: Julius Michaelis
>            Priority: Minor
> A RocksDB state backend with HDFS checkpoints, with or without local recovery, may leak
files in {{io.tmp.dirs}} on checkpoint expiry by timeout.
> If the size of a checkpoint crosses what can be transferred during one checkpoint timeout,
checkpoints will continue to fail forever. If this is combined with a quick rollover of SST
files (e.g. due to a high density of writes), this may quickly exhaust available disk space
(or memory, as /tmp is the default location).
> As a workaround, the jobmanager's REST API can be frequently queried for failed checkpoints,
and associated files deleted accordingly.
> I've tried investing the cause a little bit, but I'm stuck:
>  * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before completing.}}
and similar gets printed, so
>  * [{{abortExpired}} is invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549],
>  * [{{dispose}} is invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416],
>  * [{{cancelCaller}} is invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488],
>  * [the canceler is invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497]
([through one more layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]),
>  * [{{cleanup}} is invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95],
(possibly [not from {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]),
>  * [{{cleanupProvidedResources}} is invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162]
(this is the indirection that made me give up), so
>  * [this trace log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372]
should be printed, but it isn't.
> I have some time to further investigate, but I'd appreciate help on finding out where
in this chain things go wrong.

This message was sent by Atlassian Jira

View raw message