flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (Jira)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-14074) MesosResourceManager can't create new taskmanagers in Session Cluster Mode.
Date Thu, 24 Oct 2019 20:34:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-14074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Till Rohrmann updated FLINK-14074:
----------------------------------
    Affects Version/s: 1.10.0

> MesosResourceManager can't create new taskmanagers in Session Cluster Mode.
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-14074
>                 URL: https://issues.apache.org/jira/browse/FLINK-14074
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Mesos
>    Affects Versions: 1.9.0, 1.10.0
>         Environment: Flink HA Session cluster 1.9.0 on mesos.
>            Reporter: Alexander Kasyanenko
>            Priority: Blocker
>             Fix For: 1.10.0, 1.9.2
>
>
> Hi, I'm trying to launch multiple jobs in Flink Session Cluster, deployed on mesos.
>  Flink's version is 1.9.0.
> The very first resource allocation completes successfully, and first submitted job launches,
but submitting any amount of jobs afterwords doesn't affect the cluster in any way and no
additional TaskManagers are allocated.
> From the logs I see that MesosResourceManager is requesting Slots for the newly submitted
jobs:  "{{o.a.f.m.r.c.MesosResourceManager - Request slot with profile ResourceProfile..."}}
but line {{"Starting a new worker.}}" appears in log only the same amount of times as taskmanagers
count, allocated for the first job.
> I'm a complete noob in flink internals, but took a wild guess about a reason. I think
that the problem is in this check: [https://github.com/apache/flink/blob/release-1.9.0/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosResourceManager.java#L436]
> It might be that RM is lazily allocated at the first call by a factory, and then a private
final field {{slotsPerWorker}} is set. So this check will prevent creation of any new worker
after iterator traverses the entire collection. My main assumption is that {{slotsPerWorker}}
is never modified again.
>  
> I'm sorry that I didn't do much of investigation before reporting, but I'll try to do
some after a weekend. I plan to build flink without this check and see if it helps. Also
I'll play around with tests for this RM. Since it's my time running time flink internals,
I'll be back after a few days.
> Any help will much appreciated.
> Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message