flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-9567) Flink does not release resource in Yarn Cluster mode
Date Mon, 02 Jul 2018 09:00:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529560#comment-16529560

ASF GitHub Bot commented on FLINK-9567:

GitHub user Clarkkkkk opened a pull request:


    [FLINK-9567][runtime][yarn] Fix the bug that Flink does not release YarnContainer in some

    ## What is the purpose of the change
      - This pull request responds to  [JIRA issue FLINK-9567](https://issues.apache.org/jira/browse/FLINK-9567).

      - This pull request is to avoid flink system from not releasing excessive container
when Yarn Callback onContainerCompleted was called after a full restart.
    ## Brief change log
      - Modify the onContainerCompleted method in YarnResourceManager.
     - Add a getNumberPendingSlotRequests in SlotManager that check how many pending slot
requests are not fulfilled
      - Add a getNumberPendingSlotRequests in ResourceManager that get pending slot requests
from SlotManager
    ## Verifying this change
    This  change is covered by testOnContainerCompleted added in YarnResourceManagerTest
    ## Does this pull request potentially affect one of the following parts:
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing,
Yarn/Mesos, ZooKeeper: (yes)
      - The S3 file system connector: (no)
    ## Documentation
      - Does this pull request introduce a new feature? (No)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Clarkkkkk/flink master

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6237
commit 92086f1c56d9d170619fae170aed092e075c7c63
Author: yangshimin <yangshimin@...>
Date:   2018-07-02T03:56:00Z

    [FLINK-9567][runtime][yarn] Fix the bug that Flink does not release Yarn container when
onContainerCompleted callback method happened after full restart
    [FLINK-9567][runtime][yarn] Fix the bug that Flink does not release Yarn container when
onContainerCompleted callback method happened after full restart


> Flink does not release resource in Yarn Cluster mode
> ----------------------------------------------------
>                 Key: FLINK-9567
>                 URL: https://issues.apache.org/jira/browse/FLINK-9567
>             Project: Flink
>          Issue Type: Bug
>          Components: Cluster Management, YARN
>    Affects Versions: 1.5.0
>            Reporter: Shimin Yang
>            Assignee: Shimin Yang
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.6.0
>         Attachments: FlinkYarnProblem, fulllog.txt
> After restart the Job Manager in Yarn Cluster mode, sometimes Flink does not release
task manager containers in some specific case. In the worst case, I had a job configured
to 5 task managers, but possess more than 100 containers in the end. Although the task didn't
failed, but it affect other jobs in the Yarn Cluster.
> In the first log I posted, the container with id 24 is the reason why Yarn did not release
resources. As the container was killed before restart, but it has not received the callback
of *onContainerComplete* in *YarnResourceManager* which should be called by *AMRMAsyncClient*
of Yarn. After restart, as we can see in line 347 of FlinkYarnProblem log, 
> 2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - Association with
remote system [akka.tcp://flink@bd-r1hdp69:30609] has failed, address is now gated for [50]
ms. Reason: [Disassociated]
> Flink lost the connection of container 24 which is on bd-r1hdp69 machine. When it try
to call *closeTaskManagerConnection* in *onContainerComplete*, it did not has the connection
to TaskManager on container 24, so it just ignore the close of TaskManger.
> 2018-06-14 22:50:51,812 DEBUG org.apache.flink.yarn.YarnResourceManager - No open TaskExecutor
connection container_1528707394163_29461_02_000024. Ignoring close TaskExecutor connection.
>  However, bafore calling *closeTaskManagerConnection,* it already called *requestYarnContainer* which
lead to *numPendingContainerRequests variable in* *YarnResourceManager* increased by 1.
> As the excessive container return is determined by the *numPendingContainerRequests* variable
in *YarnResourceManager*, it cannot return this container although it is not required. Meanwhile,
the restart logic has already allocated enough containers for Task Managers, Flink will possess
the extra container for a long time for nothing. 
> In the full log, the job ended with 7 containers while only 3 are running TaskManagers.
> ps: Another strange thing I found is that when sometimes request for a yarn container,
it will return much more than requested. Is it a normal scenario for AMRMAsyncClient?

This message was sent by Atlassian JIRA

View raw message