tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Zhurakousky <ozhurakou...@hortonworks.com>
Subject Re: Deadlock in DAGAppMaster during shutdown.
Date Wed, 11 Jun 2014 15:48:56 GMT
Subroto

Perhaps you can attach your client code.
I was just trying to reproduce the problem following your scenario (as written) and DAGAppMaster
shuts down successfully 

Oleg

On Jun 11, 2014, at 10:05 AM, Oleg Zhurakousky <ozhurakousky@hortonworks.com> wrote:

> Subroto
> 
> Sorry for not getting back to you earlier
> I don’t believe there is a dedicated JIRA, but I’ve seen this and other similar problems
and I’am working on it as we speak as part of a different issue. 
> I’ll follow up before the end of today (EST), but feel free to raise JIRA
> 
> Oleg
> 
> On Jun 11, 2014, at 4:51 AM, Subroto Sanyal <sanyalsubroto@gmail.com> wrote:
> 
>> Hi Bikas, Hitesh
>> 
>> The Tezsession.stop() is invoked as part of my Client flow.
>> order of execution:
>> 1) Create MiniTezCluster
>> 2) Create Tez Session
>> 3) Create DAG
>> 4) Submit DAG to Tez Session and wait for completion
>> 5) Repeat step 4 for different DAGs
>> 6) Stop Tez Session
>> 7) Stop MiniTezCluster
>> 
>> PFA the container logs and thread-dump of DAGAppMaster
>> 
>> 
>> On Wed, Jun 11, 2014 at 1:23 AM, Bikas Saha <bikas@hortonworks.com> wrote:
>> Can you please clarify TezSession is stopped? Has TezSession.stop() been
>> called? If not then the session app on the cluster will not stop. It will
>> stop after its been idle (no DAG running) for a configurable timeout period.
>> 
>> If TezSession.stop() has been called then the AM might keep running and
>> clean up existing running tasks etc. Then exit when this cleanup is done.
>> TezSession.stop() is not blocking on the client. So the method can return
>> before the app exits.
>> 
>> Bikas
>> 
>> -----Original Message-----
>> From: Subroto Sanyal [mailto:sanyalsubroto@gmail.com]
>> Sent: Tuesday, June 10, 2014 3:27 PM
>> To: dev@tez.incubator.apache.org
>> Subject: Re: Deadlock in DAGAppMaster during shutdown.
>> 
>> Hi,
>> 
>> I have build  the Tez jars from the git repository today; still, I see the
>> DAGAppMaster running even after the TezSession is stopped.
>> Do I need to get the code/jar from somewhere else to get the fix reflected?
>> 
>> 
>> On Tue, Jun 10, 2014 at 1:54 PM, Subroto Sanyal <sanyalsubroto@gmail.com>
>> wrote:
>> 
>> > Hi Oleg,
>> >
>> >
>> > Thanks for confirming. Could you please provide the TEZ jira tickets
>> > for both of the issue where they have been solved.
>> > I couldn't find the code changes for closing TezClient.
>> >
>> >
>> > On Tue, Jun 10, 2014 at 1:25 PM, Oleg Zhurakousky <
>> > ozhurakousky@hortonworks.com> wrote:
>> >
>> >> Subroto
>> >>
>> >> Thanks for pointing this out.
>> >> This and the TezClient issue you’ve pointed out in your previous
>> >> email is actually being actively addressed
>> >>
>> >> Oleg
>> >>
>> >> On Jun 10, 2014, at 5:42 AM, Subroto Sanyal <sanyalsubroto@gmail.com>
>> >> wrote:
>> >>
>> >> > In the class AMRMClientAsyncImpl the object(7c3041e28) is being
>> >> > locked
>> >> by
>> >> > Heartbeat thread(which kinds of run a infinite loop as any
>> >> > heartbeat
>> >> > thread) which is requested to be locked by the method
>> >> > unregisterApplicationMaster.
>> >> >
>> >> > Once the method unregisterApplicationMaster can lock the requested
>> >> object;
>> >> > then only it can notify the heartbeat thread to exit by a boolean
>> >> > flag keepRunning.
>> >> >
>> >> > Following is the thread-dump for the deadlock:
>> >> >
>> >> > "AMShutdownThread" daemon prio=5 tid=7f9a02921800 nid=0x115d68000
>> >> waiting
>> >> > for monitor entry [115d67000]
>> >> >
>> >> >   java.lang.Thread.State: BLOCKED (on object monitor)
>> >> >
>> >> > at
>> >> >
>> >> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.unre
>> >> gisterApplicationMaster(AMRMClientAsyncImpl.java:156)
>> >> >
>> >> > - waiting to lock <7c3041e28> (a java.lang.Object)
>> >> >
>> >> > at
>> >> >
>> >> org.apache.tez.dag.app.rm.TaskScheduler.serviceStop(TaskScheduler.jav
>> >> a:394)
>> >> >
>> >> > - locked <7c3006aa0> (a org.apache.tez.dag.app.rm.TaskScheduler)
>> >> >
>> >> > at
>> >> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2
>> >> 21)
>> >> >
>> >> > - locked <7c3038008> (a java.lang.Object)
>> >> >
>> >> > at
>> >> >
>> >> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.serviceStop(TaskS
>> >> chedulerEventHandler.java:357)
>> >> >
>> >> > at
>> >> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2
>> >> 21)
>> >> >
>> >> > - locked <7c2f71360> (a java.lang.Object)
>> >> >
>> >> > at
>> >> >
>> >> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.ja
>> >> va:52)
>> >> >
>> >> > at
>> >> >
>> >> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperat
>> >> ions.java:80)
>> >> >
>> >> > at
>> >> org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:15
>> >> 18)
>> >> >
>> >> > at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:
>> >> 1649)
>> >> >
>> >> > - locked <7c2f51790> (a org.apache.tez.dag.app.DAGAppMaster)
>> >> >
>> >> > at
>> >> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2
>> >> 21)
>> >> >
>> >> > - locked <7c2fed728> (a java.lang.Object)
>> >> >
>> >> > at
>> >> >
>> >> org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShu
>> >> tdownRunnable.run(DAGAppMaster.java:607)
>> >> >
>> >> > at java.lang.Thread.run(Thread.java:695)
>> >> >
>> >> >
>> >> > "AMRM Heartbeater thread" prio=5 tid=7f9a0c0e8800 nid=0x111e70000
>> >> waiting
>> >> > on condition [111e6f000]
>> >> >
>> >> >   java.lang.Thread.State: TIMED_WAITING (sleeping)
>> >> >
>> >> > at java.lang.Thread.sleep(Native Method)
>> >> >
>> >> > at
>> >> >
>> >> org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(Thread
>> >> Util.java:43)
>> >> >
>> >> > at
>> >> >
>> >> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat
>> >> ionHandler.java:150)
>> >> >
>> >> > at com.sun.proxy.$Proxy9.allocate(Unknown Source)
>> >> >
>> >> > at
>> >> >
>> >> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMCl
>> >> ientImpl.java:246)
>> >> >
>> >> > at
>> >> >
>> >> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$Hear
>> >> tbeatThread.run(AMRMClientAsyncImpl.java:224)
>> >> >
>> >> > - locked <7c3041e28> (a java.lang.Object)
>> >> >
>> >> > *public void unregisterApplicationMaster(FinalApplicationStatus
>> >> appStatus,*
>> >> >
>> >> > *      String appMessage, String appTrackingUrl) throws YarnException,*
>> >> >
>> >> > *      IOException {*
>> >> >
>> >> > *    synchronized (unregisterHeartbeatLock) {*
>> >> >
>> >> > *      keepRunning = false;*
>> >> >
>> >> > *      client.unregisterApplicationMaster(appStatus, appMessage,
>> >> > appTrackingUrl);*
>> >> >
>> >> > *    }*
>> >> >
>> >> > *  }*
>> >> >
>> >> >
>> >> > The line "keepRunning = false" should be outside the synchronized
>> >> > block.
>> >> >
>> >> > I am not sure this should be regarded as problem in yarn or TEZ.
>> >> > The
>> >> flag
>> >> > is private and can't be accessed by Tez implementation
>> >> TezAMRMClientAsync.
>> >> >
>> >> >
>> >> > --
>> >> > Cheers,
>> >> > *Subroto Sanyal*
>> >>
>> >>
>> >> --
>> >> CONFIDENTIALITY NOTICE
>> >> NOTICE: This message is intended for the use of the individual or
>> >> entity to which it is addressed and may contain information that is
>> >> confidential, privileged and exempt from disclosure under applicable
>> >> law. If the reader of this message is not the intended recipient, you
>> >> are hereby notified that any printing, copying, dissemination,
>> >> distribution, disclosure or forwarding of this communication is
>> >> strictly prohibited. If you have received this communication in
>> >> error, please contact the sender immediately and delete it from your
>> >> system. Thank You.
>> >>
>> >
>> >
>> >
>> > --
>> > Cheers,
>> > *Subroto Sanyal*
>> >
>> 
>> 
>> 
>> --
>> Cheers,
>> *Subroto Sanyal*
>> 
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>> 
>> 
>> 
>> -- 
>> Cheers,
>> Subroto Sanyal
>> <AMLogs.zip>
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message