tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Subroto Sanyal <sanyalsubr...@gmail.com>
Subject Re: Deadlock in DAGAppMaster during shutdown.
Date Wed, 11 Jun 2014 08:51:59 GMT
Hi Bikas, Hitesh

The Tezsession.stop() is invoked as part of my Client flow.
order of execution:
1) Create MiniTezCluster
2) Create Tez Session
3) Create DAG
4) Submit DAG to Tez Session and wait for completion
5) Repeat step 4 for different DAGs
6) Stop Tez Session
7) Stop MiniTezCluster

PFA the container logs and thread-dump of DAGAppMaster


On Wed, Jun 11, 2014 at 1:23 AM, Bikas Saha <bikas@hortonworks.com> wrote:

> Can you please clarify TezSession is stopped? Has TezSession.stop() been
> called? If not then the session app on the cluster will not stop. It will
> stop after its been idle (no DAG running) for a configurable timeout
> period.
>
> If TezSession.stop() has been called then the AM might keep running and
> clean up existing running tasks etc. Then exit when this cleanup is done.
> TezSession.stop() is not blocking on the client. So the method can return
> before the app exits.
>
> Bikas
>
> -----Original Message-----
> From: Subroto Sanyal [mailto:sanyalsubroto@gmail.com]
> Sent: Tuesday, June 10, 2014 3:27 PM
> To: dev@tez.incubator.apache.org
> Subject: Re: Deadlock in DAGAppMaster during shutdown.
>
> Hi,
>
> I have build  the Tez jars from the git repository today; still, I see the
> DAGAppMaster running even after the TezSession is stopped.
> Do I need to get the code/jar from somewhere else to get the fix reflected?
>
>
> On Tue, Jun 10, 2014 at 1:54 PM, Subroto Sanyal <sanyalsubroto@gmail.com>
> wrote:
>
> > Hi Oleg,
> >
> >
> > Thanks for confirming. Could you please provide the TEZ jira tickets
> > for both of the issue where they have been solved.
> > I couldn't find the code changes for closing TezClient.
> >
> >
> > On Tue, Jun 10, 2014 at 1:25 PM, Oleg Zhurakousky <
> > ozhurakousky@hortonworks.com> wrote:
> >
> >> Subroto
> >>
> >> Thanks for pointing this out.
> >> This and the TezClient issue you’ve pointed out in your previous
> >> email is actually being actively addressed
> >>
> >> Oleg
> >>
> >> On Jun 10, 2014, at 5:42 AM, Subroto Sanyal <sanyalsubroto@gmail.com>
> >> wrote:
> >>
> >> > In the class AMRMClientAsyncImpl the object(7c3041e28) is being
> >> > locked
> >> by
> >> > Heartbeat thread(which kinds of run a infinite loop as any
> >> > heartbeat
> >> > thread) which is requested to be locked by the method
> >> > unregisterApplicationMaster.
> >> >
> >> > Once the method unregisterApplicationMaster can lock the requested
> >> object;
> >> > then only it can notify the heartbeat thread to exit by a boolean
> >> > flag keepRunning.
> >> >
> >> > Following is the thread-dump for the deadlock:
> >> >
> >> > "AMShutdownThread" daemon prio=5 tid=7f9a02921800 nid=0x115d68000
> >> waiting
> >> > for monitor entry [115d67000]
> >> >
> >> >   java.lang.Thread.State: BLOCKED (on object monitor)
> >> >
> >> > at
> >> >
> >> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.unre
> >> gisterApplicationMaster(AMRMClientAsyncImpl.java:156)
> >> >
> >> > - waiting to lock <7c3041e28> (a java.lang.Object)
> >> >
> >> > at
> >> >
> >> org.apache.tez.dag.app.rm.TaskScheduler.serviceStop(TaskScheduler.jav
> >> a:394)
> >> >
> >> > - locked <7c3006aa0> (a org.apache.tez.dag.app.rm.TaskScheduler)
> >> >
> >> > at
> >> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2
> >> 21)
> >> >
> >> > - locked <7c3038008> (a java.lang.Object)
> >> >
> >> > at
> >> >
> >> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.serviceStop(TaskS
> >> chedulerEventHandler.java:357)
> >> >
> >> > at
> >> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2
> >> 21)
> >> >
> >> > - locked <7c2f71360> (a java.lang.Object)
> >> >
> >> > at
> >> >
> >> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.ja
> >> va:52)
> >> >
> >> > at
> >> >
> >> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperat
> >> ions.java:80)
> >> >
> >> > at
> >> org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:15
> >> 18)
> >> >
> >> > at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:
> >> 1649)
> >> >
> >> > - locked <7c2f51790> (a org.apache.tez.dag.app.DAGAppMaster)
> >> >
> >> > at
> >> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2
> >> 21)
> >> >
> >> > - locked <7c2fed728> (a java.lang.Object)
> >> >
> >> > at
> >> >
> >> org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShu
> >> tdownRunnable.run(DAGAppMaster.java:607)
> >> >
> >> > at java.lang.Thread.run(Thread.java:695)
> >> >
> >> >
> >> > "AMRM Heartbeater thread" prio=5 tid=7f9a0c0e8800 nid=0x111e70000
> >> waiting
> >> > on condition [111e6f000]
> >> >
> >> >   java.lang.Thread.State: TIMED_WAITING (sleeping)
> >> >
> >> > at java.lang.Thread.sleep(Native Method)
> >> >
> >> > at
> >> >
> >> org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(Thread
> >> Util.java:43)
> >> >
> >> > at
> >> >
> >> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat
> >> ionHandler.java:150)
> >> >
> >> > at com.sun.proxy.$Proxy9.allocate(Unknown Source)
> >> >
> >> > at
> >> >
> >> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMCl
> >> ientImpl.java:246)
> >> >
> >> > at
> >> >
> >> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$Hear
> >> tbeatThread.run(AMRMClientAsyncImpl.java:224)
> >> >
> >> > - locked <7c3041e28> (a java.lang.Object)
> >> >
> >> > *public void unregisterApplicationMaster(FinalApplicationStatus
> >> appStatus,*
> >> >
> >> > *      String appMessage, String appTrackingUrl) throws
> YarnException,*
> >> >
> >> > *      IOException {*
> >> >
> >> > *    synchronized (unregisterHeartbeatLock) {*
> >> >
> >> > *      keepRunning = false;*
> >> >
> >> > *      client.unregisterApplicationMaster(appStatus, appMessage,
> >> > appTrackingUrl);*
> >> >
> >> > *    }*
> >> >
> >> > *  }*
> >> >
> >> >
> >> > The line "keepRunning = false" should be outside the synchronized
> >> > block.
> >> >
> >> > I am not sure this should be regarded as problem in yarn or TEZ.
> >> > The
> >> flag
> >> > is private and can't be accessed by Tez implementation
> >> TezAMRMClientAsync.
> >> >
> >> >
> >> > --
> >> > Cheers,
> >> > *Subroto Sanyal*
> >>
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or
> >> entity to which it is addressed and may contain information that is
> >> confidential, privileged and exempt from disclosure under applicable
> >> law. If the reader of this message is not the intended recipient, you
> >> are hereby notified that any printing, copying, dissemination,
> >> distribution, disclosure or forwarding of this communication is
> >> strictly prohibited. If you have received this communication in
> >> error, please contact the sender immediately and delete it from your
> >> system. Thank You.
> >>
> >
> >
> >
> > --
> > Cheers,
> > *Subroto Sanyal*
> >
>
>
>
> --
> Cheers,
> *Subroto Sanyal*
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
Cheers,
*Subroto Sanyal*

Mime
View raw message