flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "vinoyang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-8871) Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager
Date Tue, 12 Mar 2019 01:35:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790118#comment-16790118

vinoyang commented on FLINK-8871:

[~yunta] Your solution sounds good, but it would wait for other things to be done, more
details: discussion under FLINK-10966.

> Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager
> -------------------------------------------------------------------------------------------
>                 Key: FLINK-8871
>                 URL: https://issues.apache.org/jira/browse/FLINK-8871
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0
>            Reporter: Stefan Richter
>            Priority: Critical
> Flink currently lacks any form of feedback mechanism from the job manager / checkpoint
coordinator to the tasks when it comes to failing a checkpoint. This means that running snapshots
on the tasks are also not stopped even if their owning checkpoint is already cancelled. Two
examples for cases where this applies are checkpoint timeouts and local checkpoint failures
on a task together with a configuration that does not fail tasks on checkpoint failure. Notice
that those running snapshots do no longer account for the maximum number of parallel checkpoints,
because their owning checkpoint is considered as cancelled.
> Not stopping the task's snapshot thread can lead to a problematic situation where the
next checkpoints already started, while the abandoned checkpoint thread from a previous checkpoint
is still lingering around running. This scenario can potentially cascade: many parallel checkpoints
will slow down checkpointing and make timeouts even more likely.
> A possible solution is introducing a {{cancelCheckpoint}} method  as counterpart to
the {{triggerCheckpoint}} method in the task manager gateway, which is invoked by the checkpoint
coordinator as part of cancelling the checkpoint.

This message was sent by Atlassian JIRA

View raw message