tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Subroto Sanyal <sanyalsubr...@gmail.com>
Subject Re: Deadlock in DAGAppMaster during shutdown.
Date Tue, 10 Jun 2014 22:26:54 GMT
Hi,

I have build  the Tez jars from the git repository today; still, I see the
DAGAppMaster running even after the TezSession is stopped.
Do I need to get the code/jar from somewhere else to get the fix reflected?


On Tue, Jun 10, 2014 at 1:54 PM, Subroto Sanyal <sanyalsubroto@gmail.com>
wrote:

> Hi Oleg,
>
>
> Thanks for confirming. Could you please provide the TEZ jira tickets for
> both of the issue where they have been solved.
> I couldn't find the code changes for closing TezClient.
>
>
> On Tue, Jun 10, 2014 at 1:25 PM, Oleg Zhurakousky <
> ozhurakousky@hortonworks.com> wrote:
>
>> Subroto
>>
>> Thanks for pointing this out.
>> This and the TezClient issue you’ve pointed out in your previous email is
>> actually being actively addressed
>>
>> Oleg
>>
>> On Jun 10, 2014, at 5:42 AM, Subroto Sanyal <sanyalsubroto@gmail.com>
>> wrote:
>>
>> > In the class AMRMClientAsyncImpl the object(7c3041e28) is being locked
>> by
>> > Heartbeat thread(which kinds of run a infinite loop as any heartbeat
>> > thread) which is requested to be locked by the method
>> > unregisterApplicationMaster.
>> >
>> > Once the method unregisterApplicationMaster can lock the requested
>> object;
>> > then only it can notify the heartbeat thread to exit by a boolean flag
>> > keepRunning.
>> >
>> > Following is the thread-dump for the deadlock:
>> >
>> > "AMShutdownThread" daemon prio=5 tid=7f9a02921800 nid=0x115d68000
>> waiting
>> > for monitor entry [115d67000]
>> >
>> >   java.lang.Thread.State: BLOCKED (on object monitor)
>> >
>> > at
>> >
>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.unregisterApplicationMaster(AMRMClientAsyncImpl.java:156)
>> >
>> > - waiting to lock <7c3041e28> (a java.lang.Object)
>> >
>> > at
>> >
>> org.apache.tez.dag.app.rm.TaskScheduler.serviceStop(TaskScheduler.java:394)
>> >
>> > - locked <7c3006aa0> (a org.apache.tez.dag.app.rm.TaskScheduler)
>> >
>> > at
>> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>> >
>> > - locked <7c3038008> (a java.lang.Object)
>> >
>> > at
>> >
>> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.serviceStop(TaskSchedulerEventHandler.java:357)
>> >
>> > at
>> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>> >
>> > - locked <7c2f71360> (a java.lang.Object)
>> >
>> > at
>> >
>> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>> >
>> > at
>> >
>> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>> >
>> > at
>> org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1518)
>> >
>> > at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:
>> 1649)
>> >
>> > - locked <7c2f51790> (a org.apache.tez.dag.app.DAGAppMaster)
>> >
>> > at
>> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>> >
>> > - locked <7c2fed728> (a java.lang.Object)
>> >
>> > at
>> >
>> org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShutdownRunnable.run(DAGAppMaster.java:607)
>> >
>> > at java.lang.Thread.run(Thread.java:695)
>> >
>> >
>> > "AMRM Heartbeater thread" prio=5 tid=7f9a0c0e8800 nid=0x111e70000
>> waiting
>> > on condition [111e6f000]
>> >
>> >   java.lang.Thread.State: TIMED_WAITING (sleeping)
>> >
>> > at java.lang.Thread.sleep(Native Method)
>> >
>> > at
>> >
>> org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(ThreadUtil.java:43)
>> >
>> > at
>> >
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:150)
>> >
>> > at com.sun.proxy.$Proxy9.allocate(Unknown Source)
>> >
>> > at
>> >
>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:246)
>> >
>> > at
>> >
>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
>> >
>> > - locked <7c3041e28> (a java.lang.Object)
>> >
>> > *public void unregisterApplicationMaster(FinalApplicationStatus
>> appStatus,*
>> >
>> > *      String appMessage, String appTrackingUrl) throws YarnException,*
>> >
>> > *      IOException {*
>> >
>> > *    synchronized (unregisterHeartbeatLock) {*
>> >
>> > *      keepRunning = false;*
>> >
>> > *      client.unregisterApplicationMaster(appStatus, appMessage,
>> > appTrackingUrl);*
>> >
>> > *    }*
>> >
>> > *  }*
>> >
>> >
>> > The line "keepRunning = false" should be outside the synchronized block.
>> >
>> > I am not sure this should be regarded as problem in yarn or TEZ. The
>> flag
>> > is private and can't be accessed by Tez implementation
>> TezAMRMClientAsync.
>> >
>> >
>> > --
>> > Cheers,
>> > *Subroto Sanyal*
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified
>> that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>> immediately
>> and delete it from your system. Thank You.
>>
>
>
>
> --
> Cheers,
> *Subroto Sanyal*
>



-- 
Cheers,
*Subroto Sanyal*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message