tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bikas Saha <bi...@hortonworks.com>
Subject RE: Deadlock in DAGAppMaster during shutdown.
Date Wed, 11 Jun 2014 16:39:43 GMT
If steps 6 and 7 happen before the Tez AM shuts down then the AM will not
exit for a long time. This is because shutting down the mini cluster shuts
down the NM but may not shut down the AM. The AM after clean up will try to
unregister from the RM before shutting itself down. Since the RM is already
gone, the unregister will keep retrying for some time (for the High
availability case since the RM may have just crashed and will come back
up). So you will see the AM process hanging around for some time.



You can confirm this by checking that when the AM is hanging around, are
the NM and RM processes gone. And checking for this message in the AM logs
“Waiting for application to be successful”



Bikas



*From:* Subroto Sanyal [mailto:sanyalsubroto@gmail.com]
*Sent:* Wednesday, June 11, 2014 1:52 AM
*To:* dev@tez.incubator.apache.org
*Subject:* Re: Deadlock in DAGAppMaster during shutdown.



Hi Bikas, Hitesh



The Tezsession.stop() is invoked as part of my Client flow.

order of execution:

1) Create MiniTezCluster

2) Create Tez Session

3) Create DAG

4) Submit DAG to Tez Session and wait for completion

5) Repeat step 4 for different DAGs

6) Stop Tez Session

7) Stop MiniTezCluster



PFA the container logs and thread-dump of DAGAppMaster



On Wed, Jun 11, 2014 at 1:23 AM, Bikas Saha <bikas@hortonworks.com> wrote:

Can you please clarify TezSession is stopped? Has TezSession.stop() been
called? If not then the session app on the cluster will not stop. It will
stop after its been idle (no DAG running) for a configurable timeout period.

If TezSession.stop() has been called then the AM might keep running and
clean up existing running tasks etc. Then exit when this cleanup is done.
TezSession.stop() is not blocking on the client. So the method can return
before the app exits.

Bikas


-----Original Message-----
From: Subroto Sanyal [mailto:sanyalsubroto@gmail.com]
Sent: Tuesday, June 10, 2014 3:27 PM
To: dev@tez.incubator.apache.org
Subject: Re: Deadlock in DAGAppMaster during shutdown.

Hi,

I have build  the Tez jars from the git repository today; still, I see the
DAGAppMaster running even after the TezSession is stopped.
Do I need to get the code/jar from somewhere else to get the fix reflected?


On Tue, Jun 10, 2014 at 1:54 PM, Subroto Sanyal <sanyalsubroto@gmail.com>
wrote:

> Hi Oleg,
>
>
> Thanks for confirming. Could you please provide the TEZ jira tickets
> for both of the issue where they have been solved.
> I couldn't find the code changes for closing TezClient.
>
>
> On Tue, Jun 10, 2014 at 1:25 PM, Oleg Zhurakousky <
> ozhurakousky@hortonworks.com> wrote:
>
>> Subroto
>>
>> Thanks for pointing this out.
>> This and the TezClient issue you’ve pointed out in your previous
>> email is actually being actively addressed
>>
>> Oleg
>>
>> On Jun 10, 2014, at 5:42 AM, Subroto Sanyal <sanyalsubroto@gmail.com>
>> wrote:
>>
>> > In the class AMRMClientAsyncImpl the object(7c3041e28) is being
>> > locked
>> by
>> > Heartbeat thread(which kinds of run a infinite loop as any
>> > heartbeat
>> > thread) which is requested to be locked by the method
>> > unregisterApplicationMaster.
>> >
>> > Once the method unregisterApplicationMaster can lock the requested
>> object;
>> > then only it can notify the heartbeat thread to exit by a boolean
>> > flag keepRunning.
>> >
>> > Following is the thread-dump for the deadlock:
>> >
>> > "AMShutdownThread" daemon prio=5 tid=7f9a02921800 nid=0x115d68000
>> waiting
>> > for monitor entry [115d67000]
>> >
>> >   java.lang.Thread.State: BLOCKED (on object monitor)
>> >
>> > at
>> >
>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.unre
>> gisterApplicationMaster(AMRMClientAsyncImpl.java:156)
>> >
>> > - waiting to lock <7c3041e28> (a java.lang.Object)
>> >
>> > at
>> >
>> org.apache.tez.dag.app.rm.TaskScheduler.serviceStop(TaskScheduler.jav
>> a:394)
>> >
>> > - locked <7c3006aa0> (a org.apache.tez.dag.app.rm.TaskScheduler)
>> >
>> > at
>> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2
>> 21)
>> >
>> > - locked <7c3038008> (a java.lang.Object)
>> >
>> > at
>> >
>> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.serviceStop(TaskS
>> chedulerEventHandler.java:357)
>> >
>> > at
>> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2
>> 21)
>> >
>> > - locked <7c2f71360> (a java.lang.Object)
>> >
>> > at
>> >
>> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.ja
>> va:52)
>> >
>> > at
>> >
>> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperat
>> ions.java:80)
>> >
>> > at
>> org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:15
>> 18)
>> >
>> > at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:
>> 1649)
>> >
>> > - locked <7c2f51790> (a org.apache.tez.dag.app.DAGAppMaster)
>> >
>> > at
>> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2
>> 21)
>> >
>> > - locked <7c2fed728> (a java.lang.Object)
>> >
>> > at
>> >
>> org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShu
>> tdownRunnable.run(DAGAppMaster.java:607)
>> >
>> > at java.lang.Thread.run(Thread.java:695)
>> >
>> >
>> > "AMRM Heartbeater thread" prio=5 tid=7f9a0c0e8800 nid=0x111e70000
>> waiting
>> > on condition [111e6f000]
>> >
>> >   java.lang.Thread.State: TIMED_WAITING (sleeping)
>> >
>> > at java.lang.Thread.sleep(Native Method)
>> >
>> > at
>> >
>> org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(Thread
>> Util.java:43)
>> >
>> > at
>> >
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat
>> ionHandler.java:150)
>> >
>> > at com.sun.proxy.$Proxy9.allocate(Unknown Source)
>> >
>> > at
>> >
>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMCl
>> ientImpl.java:246)
>> >
>> > at
>> >
>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$Hear
>> tbeatThread.run(AMRMClientAsyncImpl.java:224)
>> >
>> > - locked <7c3041e28> (a java.lang.Object)
>> >
>> > *public void unregisterApplicationMaster(FinalApplicationStatus
>> appStatus,*
>> >
>> > *      String appMessage, String appTrackingUrl) throws YarnException,*
>> >
>> > *      IOException {*
>> >
>> > *    synchronized (unregisterHeartbeatLock) {*
>> >
>> > *      keepRunning = false;*
>> >
>> > *      client.unregisterApplicationMaster(appStatus, appMessage,
>> > appTrackingUrl);*
>> >
>> > *    }*
>> >
>> > *  }*
>> >
>> >
>> > The line "keepRunning = false" should be outside the synchronized
>> > block.
>> >
>> > I am not sure this should be regarded as problem in yarn or TEZ.
>> > The
>> flag
>> > is private and can't be accessed by Tez implementation
>> TezAMRMClientAsync.
>> >
>> >
>> > --
>> > Cheers,
>> > *Subroto Sanyal*
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or
>> entity to which it is addressed and may contain information that is
>> confidential, privileged and exempt from disclosure under applicable
>> law. If the reader of this message is not the intended recipient, you
>> are hereby notified that any printing, copying, dissemination,
>> distribution, disclosure or forwarding of this communication is
>> strictly prohibited. If you have received this communication in
>> error, please contact the sender immediately and delete it from your
>> system. Thank You.
>>
>
>
>
> --
> Cheers,

> *Subroto Sanyal*

>



--
Cheers,
*Subroto Sanyal*

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.





-- 
Cheers,
*Subroto Sanyal*

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message