flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4449) Heartbeat Manager between ResourceManager and TaskExecutor
Date Wed, 24 Aug 2016 13:36:21 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434945#comment-15434945

ASF GitHub Bot commented on FLINK-4449:

Github user tillrohrmann commented on the issue:

    Thanks for the contribution @beyond1920. The implementation of the `HeartbeatScheduler`
goes in the right direction so that it is reusable :-) The testing is also better. However,
we're still mixing different things in this PR (e.g. parts of the slot requesting logic).
    I think we can further generalize the heartbeating since the heartbeat manager is another
component which should be reusable across components (e.g. for the JobManager to heartbeat
the TMs). Furthermore, the receiving end of the heartbeating is not properly defined. 
    I think it would be best if we first properly define how this should look like. For example,
I'm not sure whether the exponential backoff strategy is the right way to go since it can
happen that you wait twice as long as you've defined until you're notified about a heartbeat
failure. Another question is whether every heartbeat connection should be responsible for
triggering itself or whether the heartbeat manager should be responsible for that. Then we
have to define the receiving end. Is the heartbeat receiving end an independent `RpcEndpoint`?
How does the payload delivery works? Does the sender side asks for the result (future) or
does the receiving side answers via a tell message to the heartbeat manager?
    I've created an issue where we should continue the discussion https://issues.apache.org/jira/browse/FLINK-4478.

> Heartbeat Manager between ResourceManager and TaskExecutor
> ----------------------------------------------------------
>                 Key: FLINK-4449
>                 URL: https://issues.apache.org/jira/browse/FLINK-4449
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Cluster Management
>            Reporter: zhangjing
>            Assignee: zhangjing
> HeartbeatManager is responsible for heartbeat between resourceManager to TaskExecutor
> 1. Register taskExecutors
> register heartbeat targets. If the heartbeat response for these targets is not reported
in time, mark target failed and notify resourceManager
> 2. trigger heartbeat
> trigger heartbeat from resourceManager to TaskExecutor periodically
> taskExecutor report slot allocation in the heartbeat response
> ResourceManager sync self slot allocation with the heartbeat response

This message was sent by Atlassian JIRA

View raw message