tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hitesh Shah <hit...@apache.org>
Subject Re: Deadlock in DAGAppMaster during shutdown.
Date Tue, 10 Jun 2014 23:21:35 GMT
Hi Subroto, 

Could you file a jira for this with the output of jstack for the AM process and the AM logs?


thanks
— Hitesh

On Jun 10, 2014, at 3:26 PM, Subroto Sanyal <sanyalsubroto@gmail.com> wrote:

> Hi,
> 
> I have build  the Tez jars from the git repository today; still, I see the
> DAGAppMaster running even after the TezSession is stopped.
> Do I need to get the code/jar from somewhere else to get the fix reflected?
> 
> 
> On Tue, Jun 10, 2014 at 1:54 PM, Subroto Sanyal <sanyalsubroto@gmail.com>
> wrote:
> 
>> Hi Oleg,
>> 
>> 
>> Thanks for confirming. Could you please provide the TEZ jira tickets for
>> both of the issue where they have been solved.
>> I couldn't find the code changes for closing TezClient.
>> 
>> 
>> On Tue, Jun 10, 2014 at 1:25 PM, Oleg Zhurakousky <
>> ozhurakousky@hortonworks.com> wrote:
>> 
>>> Subroto
>>> 
>>> Thanks for pointing this out.
>>> This and the TezClient issue you’ve pointed out in your previous email is
>>> actually being actively addressed
>>> 
>>> Oleg
>>> 
>>> On Jun 10, 2014, at 5:42 AM, Subroto Sanyal <sanyalsubroto@gmail.com>
>>> wrote:
>>> 
>>>> In the class AMRMClientAsyncImpl the object(7c3041e28) is being locked
>>> by
>>>> Heartbeat thread(which kinds of run a infinite loop as any heartbeat
>>>> thread) which is requested to be locked by the method
>>>> unregisterApplicationMaster.
>>>> 
>>>> Once the method unregisterApplicationMaster can lock the requested
>>> object;
>>>> then only it can notify the heartbeat thread to exit by a boolean flag
>>>> keepRunning.
>>>> 
>>>> Following is the thread-dump for the deadlock:
>>>> 
>>>> "AMShutdownThread" daemon prio=5 tid=7f9a02921800 nid=0x115d68000
>>> waiting
>>>> for monitor entry [115d67000]
>>>> 
>>>>  java.lang.Thread.State: BLOCKED (on object monitor)
>>>> 
>>>> at
>>>> 
>>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.unregisterApplicationMaster(AMRMClientAsyncImpl.java:156)
>>>> 
>>>> - waiting to lock <7c3041e28> (a java.lang.Object)
>>>> 
>>>> at
>>>> 
>>> org.apache.tez.dag.app.rm.TaskScheduler.serviceStop(TaskScheduler.java:394)
>>>> 
>>>> - locked <7c3006aa0> (a org.apache.tez.dag.app.rm.TaskScheduler)
>>>> 
>>>> at
>>> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>>>> 
>>>> - locked <7c3038008> (a java.lang.Object)
>>>> 
>>>> at
>>>> 
>>> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.serviceStop(TaskSchedulerEventHandler.java:357)
>>>> 
>>>> at
>>> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>>>> 
>>>> - locked <7c2f71360> (a java.lang.Object)
>>>> 
>>>> at
>>>> 
>>> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>>>> 
>>>> at
>>>> 
>>> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>>>> 
>>>> at
>>> org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1518)
>>>> 
>>>> at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:
>>> 1649)
>>>> 
>>>> - locked <7c2f51790> (a org.apache.tez.dag.app.DAGAppMaster)
>>>> 
>>>> at
>>> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>>>> 
>>>> - locked <7c2fed728> (a java.lang.Object)
>>>> 
>>>> at
>>>> 
>>> org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShutdownRunnable.run(DAGAppMaster.java:607)
>>>> 
>>>> at java.lang.Thread.run(Thread.java:695)
>>>> 
>>>> 
>>>> "AMRM Heartbeater thread" prio=5 tid=7f9a0c0e8800 nid=0x111e70000
>>> waiting
>>>> on condition [111e6f000]
>>>> 
>>>>  java.lang.Thread.State: TIMED_WAITING (sleeping)
>>>> 
>>>> at java.lang.Thread.sleep(Native Method)
>>>> 
>>>> at
>>>> 
>>> org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(ThreadUtil.java:43)
>>>> 
>>>> at
>>>> 
>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:150)
>>>> 
>>>> at com.sun.proxy.$Proxy9.allocate(Unknown Source)
>>>> 
>>>> at
>>>> 
>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:246)
>>>> 
>>>> at
>>>> 
>>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
>>>> 
>>>> - locked <7c3041e28> (a java.lang.Object)
>>>> 
>>>> *public void unregisterApplicationMaster(FinalApplicationStatus
>>> appStatus,*
>>>> 
>>>> *      String appMessage, String appTrackingUrl) throws YarnException,*
>>>> 
>>>> *      IOException {*
>>>> 
>>>> *    synchronized (unregisterHeartbeatLock) {*
>>>> 
>>>> *      keepRunning = false;*
>>>> 
>>>> *      client.unregisterApplicationMaster(appStatus, appMessage,
>>>> appTrackingUrl);*
>>>> 
>>>> *    }*
>>>> 
>>>> *  }*
>>>> 
>>>> 
>>>> The line "keepRunning = false" should be outside the synchronized block.
>>>> 
>>>> I am not sure this should be regarded as problem in yarn or TEZ. The
>>> flag
>>>> is private and can't be accessed by Tez implementation
>>> TezAMRMClientAsync.
>>>> 
>>>> 
>>>> --
>>>> Cheers,
>>>> *Subroto Sanyal*
>>> 
>>> 
>>> --
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>>> to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified
>>> that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender
>>> immediately
>>> and delete it from your system. Thank You.
>>> 
>> 
>> 
>> 
>> --
>> Cheers,
>> *Subroto Sanyal*
>> 
> 
> 
> 
> -- 
> Cheers,
> *Subroto Sanyal*


Mime
View raw message