flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Piotr Nowojski (Jira)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-22088) CheckpointCoordinator might not be able to abort triggering checkpoint if failover happens during triggering
Date Mon, 03 May 2021 05:47:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-22088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338185#comment-17338185
] 

Piotr Nowojski commented on FLINK-22088:
----------------------------------------

[~gaoyunhaii], as I understand the impact of this issue is not very severe? Extra checkpoint
will be triggered and it will be declined by task managers?

> CheckpointCoordinator might not be able to abort triggering checkpoint if failover happens
during triggering
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-22088
>                 URL: https://issues.apache.org/jira/browse/FLINK-22088
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.12.2, 1.13.0
>            Reporter: Yun Gao
>            Priority: Minor
>
> Currently when job failover, it would try to cancel all the pending checkpoint via CheckpointCoordinatorDeActivator#jobStatusChanges
-> stopCheckpointScheduler, it would try to cancel all the pending checkpoints and also
set periodicScheduling to false. 
> If at this time there is just one checkpoint start triggering, it might acquire all the
execution to trigger before failover and start triggering. ideally it should be aborted in
createPendingCheckpoint-> preCheckGlobalState. However, since the check and creating pending
checkpoint is in two different scope, there might be cases the CheckpointCoordinator#stopCheckpointScheduler
happens during the two lock scope. 
> We may optimize this checking; However, since the execution would finally fail to trigger
checkpoint, it should not affect the rightness of the job. Besides, even if we optimize it,
there might still be cases that the execution trigger failed due to concurrent failover. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message