hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Question about YARN NodeManager and ApplicationMaster failures
Date Thu, 03 Mar 2016 13:16:40 GMT

> On 3 Mar 2016, at 12:58, Dustin Cote <dcote@cloudera.com> wrote:
> -dev since this is more of a user question
> The NodeManager is the parent for the application master, so any containers
> (including application master containers) that are running where the failed
> NodeManager is located will die.  If an application master fails, then a
> new one is created up to your limit (set by
> yarn.resourcemanager.am.max-attempts).  The other containers associated
> with the application master are supposed to continue on and pick up the
> newly started application master.  

Only if you tell yarn to keep containers over restart and the AM has the code to rebuild its
state. Most of AM's don't do this (MR, Tez, Spark, etc), as the state is hard to preserve
and rebuild.

See YARN-896 for all the details of things related to long-lived services

You can also put a reset window on AM failures, YARN-611.

Oh, and there's work-preserving NM restart, but that's another topic  .... 

> The resource manager takes care of the
> bookkeeping needed to make this happen.  I'd suggest you have a look at the
> series of blogs here
> <http://blog.cloudera.com/blog/2015/09/untangling-apache-hadoop-yarn-part-1/>
> for
> a more in depth look at the mechanics.
> -Dustin
> On Wed, Mar 2, 2016 at 8:26 PM, Sadystio Ilmatunt <urkpostenardr@gmail.com>
> wrote:
>> Hello,
>> I have some questions regarding failure of NodeManager and Application
>> Master.
>> What happens if NodeManager which is running on the same node as
>> Application Master fails?
>> Does Application Master fail as well?
>> Also How is Application Master failure handled with respect to its
>> (child) container?
>> Do these containers fail too?
>> If Yes, is there a way these containers can be assigned to new
>> instance of application master that might come up on some other node?
> -- 
> Dustin Cote
> Customer Operations Engineer
> <http://www.cloudera.com>

View raw message