flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout
Date Mon, 29 Jan 2018 16:11:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343559#comment-16343559
] 

Till Rohrmann commented on FLINK-6160:
--------------------------------------

The {{TaskExecutor}} retries a timed out connection to the {{JobMaster}} but the other components
don't yet retry their connections. We should fix this issue.

>  Retry JobManager/ResourceManager connection in case of timeout
> ---------------------------------------------------------------
>
>                 Key: FLINK-6160
>                 URL: https://issues.apache.org/jira/browse/FLINK-6160
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Distributed Coordination
>    Affects Versions: 1.3.0
>            Reporter: Till Rohrmann
>            Priority: Major
>              Labels: flip-6
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to the remote
component. Furthermore, it assumes that the component has actually failed and, thus, it will
only start trying to connect to the component if it is notified about a new leader address
and leader session id. This is brittle, because the heartbeat could also time out without
the component having crashed. Thus, we should add an automatic retry to the latest known leader
address information in case of a timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message