flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maximilian Michels (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-2967) TM address detection might not always detect the right interface on slow networks / overloaded JMs
Date Thu, 05 Nov 2015 13:42:27 GMT

    [ https://issues.apache.org/jira/browse/FLINK-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991668#comment-14991668
] 

Maximilian Michels commented on FLINK-2967:
-------------------------------------------

I think the issue is that a task manager runs on the same host as the job manager and uses
its local address to connect with the job manager. The actor system is then bound to the local
device and remote connections with other task managers fail.

I like the idea of trying {{InetAddress.getLocalHost()}} again after the fallback mechanism
has completed. This also may instantiate two actor systems. I don't think that is a big deal
during startup phase if we clean up afterwards.

> TM address detection might not always detect the right interface on slow networks / overloaded
JMs
> --------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-2967
>                 URL: https://issues.apache.org/jira/browse/FLINK-2967
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 0.9, 0.10, 1.0
>            Reporter: Robert Metzger
>            Assignee: Robert Metzger
>
> I'm talking to a user which is facing the following issue:
> Some of the TaskManagers select the wrong IP address out of the available network interfaces.
> The first address we try to connect to is the one returned by {{InetAddress.getLocalHost()}}.
This address is the right IP address to use, but the JobManager is not able to respond within
the timeout (50ms) to that connection request.
> So the TM tries the next address, which is not publicly reachable. However, the TM can
connect to the JM from there. Netty will later fail to connect to the TM from the other TMs.
> There are two solutions for this issue:
> - Allow users to configure a higher timeout for the first address detection strategy.
In most cases, the address returned by {{InetAddress.getLocalHost()}} is correct. By setting
a high timeout, users with slow networks / overloaded JMs can make sure the TM picks this
address
> - add an Akka message which we send from the TM to the JM, and the JM tries to connect
to the TM. If that succeeds, we know that the TM is reachable from the outside.
> The problem is that we have to start a separate actor system on the TaskManager first.
We have to do this because might use a wrong ip address for the TM (so we might end up starting
actor systems until we found an externally reachable ip)
> I'm first going to implement the first approach. If that solution works well for my user,
I'll contribute this to 0.10 / 1.0.
> If not, I'll implement the second approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message