From issues-return-174346-apmail-flink-issues-archive=flink.apache.org@flink.apache.org Mon Jul 2 11:21:04 2018 Return-Path: X-Original-To: apmail-flink-issues-archive@minotaur.apache.org Delivered-To: apmail-flink-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DB4DC17C26 for ; Mon, 2 Jul 2018 11:21:04 +0000 (UTC) Received: (qmail 52317 invoked by uid 500); 2 Jul 2018 11:21:04 -0000 Delivered-To: apmail-flink-issues-archive@flink.apache.org Received: (qmail 52259 invoked by uid 500); 2 Jul 2018 11:21:04 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 52245 invoked by uid 99); 2 Jul 2018 11:21:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Jul 2018 11:21:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 5B454C9E71 for ; Mon, 2 Jul 2018 11:21:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -110.301 X-Spam-Level: X-Spam-Status: No, score=-110.301 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 4isp0BXCzT5T for ; Mon, 2 Jul 2018 11:21:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id B52075FE5C for ; Mon, 2 Jul 2018 11:21:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id AF465E0EA0 for ; Mon, 2 Jul 2018 11:21:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 4A8B82183B for ; Mon, 2 Jul 2018 11:21:00 +0000 (UTC) Date: Mon, 2 Jul 2018 11:21:00 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLINK-9567) Flink does not release resource in Yarn Cluster mode MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/FLINK-9567?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1652= 9722#comment-16529722 ]=20 ASF GitHub Bot commented on FLINK-9567: --------------------------------------- Github user Clarkkkkk commented on the issue: https://github.com/apache/flink/pull/6237 =20 @tillrohrmann Hi Till, I fix this bug according to the idea you mention= ed on last conversation we had. > Flink does not release resource in Yarn Cluster mode > ---------------------------------------------------- > > Key: FLINK-9567 > URL: https://issues.apache.org/jira/browse/FLINK-9567 > Project: Flink > Issue Type: Bug > Components: Cluster Management, YARN > Affects Versions: 1.5.0 > Reporter: Shimin Yang > Assignee: Shimin Yang > Priority: Critical > Labels: pull-request-available > Fix For: 1.6.0 > > Attachments: FlinkYarnProblem, fulllog.txt > > > After restart the Job Manager in Yarn Cluster mode, sometimes Flink does = not release task manager containers in some specific case.=C2=A0In the wors= t case, I had a job configured to 5 task managers, but possess more than 10= 0 containers in the end. Although the task didn't failed, but it affect oth= er jobs in the Yarn Cluster. > In the first log I posted, the container with id 24 is the reason why Yar= n did not release resources. As the container was killed before restart, bu= t it has not received the callback of *onContainerComplete* in *YarnResourc= eManager* which should be called by *AMRMAsyncClient* of Yarn. After restar= t, as we can see=C2=A0in line 347 of FlinkYarnProblem log,=C2=A0 > 2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - Ass= ociation with remote system [akka.tcp://flink@bd-r1hdp69:30609] has failed,= address is now gated for [50] ms. Reason: [Disassociated] > Flink lost the connection of container 24 which is on bd-r1hdp69 machine.= When it try to call=C2=A0*closeTaskManagerConnection*=C2=A0in=C2=A0*onCont= ainerComplete*, it did not has the connection to TaskManager on container 2= 4, so it just ignore the close of TaskManger. > 2018-06-14 22:50:51,812 DEBUG org.apache.flink.yarn.YarnResourceManager -= No open TaskExecutor connection container_1528707394163_29461_02_000024. I= gnoring close TaskExecutor connection. > =C2=A0However, bafore calling=C2=A0*closeTaskManagerConnection,* it alrea= dy called=C2=A0*requestYarnContainer*=C2=A0which lead to=C2=A0*numPendingCo= ntainerRequests=C2=A0variable in*=C2=A0*YarnResourceManager*=C2=A0increased= by 1. > As the excessive container return is determined by the=C2=A0*numPendingCo= ntainerRequests*=C2=A0variable in *YarnResourceManager*, it cannot return t= his container although it is not required. Meanwhile, the restart logic has= already allocated enough containers for Task Managers, Flink will possess = the extra container for a long time for nothing.=C2=A0 > In the full log, the job ended with 7 containers while only 3 are running= TaskManagers. > ps: Another strange thing I found is that when sometimes request for a ya= rn container, it will return much more than requested. Is it a normal scena= rio for AMRMAsyncClient? -- This message was sent by Atlassian JIRA (v7.6.3#76005)