flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-12342) Yarn Resource Manager Acquires Too Many Containers
Date Thu, 02 May 2019 09:24:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-12342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831500#comment-16831500
] 

Till Rohrmann commented on FLINK-12342:
---------------------------------------

Thanks for the investigation of this problem [~hpeter]. I think you are right that our aggressive
{{FAST_YARN_HEARTBEAT_INTERVAL}} plus this YARN-1902 bug are the cause for the problem. If
YARN-1902 were properly resolved, then it wouldn't be a problem. Until this is the case, a
way to mitigate the problem would be to make {{FAST_YARN_HEARTBEAT_INTERVAL}} configurable
as you've suggested.

> Yarn Resource Manager Acquires Too Many Containers
> --------------------------------------------------
>
>                 Key: FLINK-12342
>                 URL: https://issues.apache.org/jira/browse/FLINK-12342
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>         Environment: We runs job in Flink release 1.6.3. 
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: Screen Shot 2019-04-29 at 12.06.23 AM.png, container.log, flink-1.4.png,
flink-1.6.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In currently implementation of YarnFlinkResourceManager, it starts to acquire new container
one by one when get request from SlotManager. The mechanism works when job is still, say less
than 32 containers. If the job has 256 container, containers can't be immediately allocated
and appending requests in AMRMClient will be not removed accordingly. We observe the situation
that AMRMClient ask for current pending request + 1 (the new request from slot manager) containers.
In this way, during the start time of such job, it asked for 4000+ containers. If there is
an external dependency issue happens, for example hdfs access is slow. Then, the whole job
will be blocked without getting enough resource and finally killed with SlotManager request
timeout.
> Thus, we should use the total number of container asked rather than pending request in
AMRMClient as threshold to make decision whether we need to add one more resource request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message