hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Haibo Chen (Jira)" <j...@apache.org>
Subject [jira] [Created] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
Date Mon, 19 Oct 2020 21:17:00 GMT
Haibo Chen created YARN-10467:
---------------------------------

             Summary: ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
                 Key: YARN-10467
                 URL: https://issues.apache.org/jira/browse/YARN-10467
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 2.10.0
            Reporter: Haibo Chen
            Assignee: Haibo Chen


In one of our recent heap analysis, we found that the majority of the heap is occupied by {{RMNodeImpl.completedContainers}}<ContainerIdPBImp>,
which accounts for 19GB, out of 24.3 GB.  There are over 86 million ContainerIdPBImpl objects,
in contrast, only 161,601 RMContainerImpl objects which represent the # of active containers
that RM is still tracking.  Inspecting some ContainerIdPBImpl objects, they belong to applications
that have long finished. This indicates some sort of memory leak of ContainerIdPBImpl objects
in RMNodeImpl.

 

Right now, when a container is reported by a NM as completed, it is immediately added to RMNodeImpl.completedContainers
and later cleaned up after the AM has been notified of its completion in the AM-RM heartbeat.
The cleanup can be broken into a few steps.
 * Step 1:  the completed container is first added to RMAppAttemptImpl.justFinishedContainers
(this is asynchronous to being added to {{RMNodeImpl.completedContainers}}).
 * Step 2: During the heartbeat AM-RM heartbeat, the container is removed from RMAppAttemptImpl.justFinishedContainers
and added to RMAppAttemptImpl.finishedContainersSentToAM

Once a completed container gets added to RMAppAttemptImpl.finishedContainersSentToAM, it
is guaranteed to be cleaned up from {{RMNodeImpl.completedContainers}}

 

However, if the AM exits (regardless of failure or success) before some recently completed
containers can be added to  RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats,
there won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, these
objects stay in RMNodeImpl.completedContainers forever.

We have observed in MR that AMs can decide to exit upon success of all it tasks without waiting
for notification of the completion of every container, or AM may just die suddenly (e.g. OOM). 
Spark and other framework may just be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org


Mime
View raw message