flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhu Zhu (Jira)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
Date Mon, 01 Jun 2020 12:59:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120997#comment-17120997

Zhu Zhu commented on FLINK-17726:

I just thought of a case that might be problematic with the proposed change.
Imagine a case like this: job {A1 -> B1}. A1 and B1 were running. Later A1 failed and B1
was CANCELED due to A1's failure.
However, the CANCELED state of B1 was reported earlier than the FAILED state of A1. 
If we trigger a failover on receiving the directly CANCELED state of B1 and start canceling
A1, the failure cause of A1 will be discarded because it will not be treated as the root failure.

Maybe we should mark this kind of directly CANCELED tasks with a dedicate exception and do
not trigger failover on them at JM side.
[~trohrmann][~nicholasjiang] WDYT?

> Scheduler should take care of tasks directly canceled by TaskManager
> --------------------------------------------------------------------
>                 Key: FLINK-17726
>                 URL: https://issues.apache.org/jira/browse/FLINK-17726
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.0, 1.12.0
>            Reporter: Zhu Zhu
>            Assignee: Nicholas Jiang
>            Priority: Critical
>             Fix For: 1.11.0, 1.12.0
> JobManager will not trigger failure handling when receiving CANCELED task update. 
> This is because CANCELED tasks are usually caused by another FAILED task. These CANCELED
tasks will be restarted by the failover process triggered  FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime issue,
the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to CANCELED from all
states except from CANCELING as failed tasks. 

This message was sent by Atlassian Jira

View raw message