flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhenqiu Huang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-12342) Yarn Resource Manager Acquires Too Many Containers
Date Thu, 02 May 2019 06:15:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-12342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831427#comment-16831427
] 

Zhenqiu Huang commented on FLINK-12342:
---------------------------------------

As using the config and set it to 3000 milliseconds, the job with 256 containers can be successfully
launched with only 1000+ total requested containers. The number can be further reduced by
using larger number, such as 5000 or even higher. So, for small jobs with 32 containers, user
should just default value for sending out request as soon as possible. For large jobs, user
need to tune the parameter to trade-off the fast request and negative impact of repetitively
as more containers.

> Yarn Resource Manager Acquires Too Many Containers
> --------------------------------------------------
>
>                 Key: FLINK-12342
>                 URL: https://issues.apache.org/jira/browse/FLINK-12342
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>         Environment: We runs job in Flink release 1.6.3. 
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: Screen Shot 2019-04-29 at 12.06.23 AM.png, container.log, flink-1.4.png,
flink-1.6.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In currently implementation of YarnFlinkResourceManager, it starts to acquire new container
one by one when get request from SlotManager. The mechanism works when job is still, say less
than 32 containers. If the job has 256 container, containers can't be immediately allocated
and appending requests in AMRMClient will be not removed accordingly. We observe the situation
that AMRMClient ask for current pending request + 1 (the new request from slot manager) containers.
In this way, during the start time of such job, it asked for 4000+ containers. If there is
an external dependency issue happens, for example hdfs access is slow. Then, the whole job
will be blocked without getting enough resource and finally killed with SlotManager request
timeout.
> Thus, we should use the total number of container asked rather than pending request in
AMRMClient as threshold to make decision whether we need to add one more resource request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message