flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhu Zhu (Jira)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-13554) ResourceManager should have a timeout on starting new TaskExecutors.
Date Wed, 08 Jan 2020 09:45:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010522#comment-17010522
] 

Zhu Zhu commented on FLINK-13554:
---------------------------------

This issue is triggered only when a TM is stuck in launching before registering to RM. Currently
we only see this case in our stability tests which break zookeeper and network connections
intentionally.
So I agree that we can postpone it as long as we do not encounter this issue in production.

> ResourceManager should have a timeout on starting new TaskExecutors.
> --------------------------------------------------------------------
>
>                 Key: FLINK-13554
>                 URL: https://issues.apache.org/jira/browse/FLINK-13554
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0
>            Reporter: Xintong Song
>            Priority: Critical
>             Fix For: 1.10.0
>
>
> Recently, we encountered a case that one TaskExecutor get stuck during launching on Yarn
(without fail), causing that job cannot recover from continuous failovers.
> The reason the TaskExecutor gets stuck is due to our environment problem. The TaskExecutor
gets stuck somewhere after the ResourceManager starts the TaskExecutor and waiting for the
TaskExecutor to be brought up and register. Later when the slot request timeouts, the job
fails over and requests slots from ResourceManager again, the ResourceManager still see a
TaskExecutor (the stuck one) is being started and will not request new container from Yarn.
Therefore, the job can not recover from failure.
> I think to avoid such unrecoverable status, the ResourceManager need to have a timeout
on starting new TaskExecutor. If the starting of TaskExecutor takes too long, it should just
fail the TaskExecutor and starts a new one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message