tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Zhurakousky <ozhurakou...@hortonworks.com>
Subject Re: Deadlock in DAGAppMaster during shutdown.
Date Thu, 12 Jun 2014 19:12:25 GMT
I was able to get to the bottom of it while working on semi-related issue https://issues.apache.org/jira/browse/TEZ-1191
and discovered a few more issues. So I’ve created a new umbrella issue  https://issues.apache.org/jira/browse/TEZ-1206
which you can track.
The good thing is that its being fixed and there will be tests for it.

Cheers
Oleg

On Jun 12, 2014, at 2:16 PM, Oleg Zhurakousky <ozhurakousky@hortonworks.com> wrote:

> Guys, I am getting to the bottom of it (slowly but surely)
> 
> Part of the issue is that a lot of interrupts are not propagated, which is one of the
reason for a deadlock.
> I’ll provide more info as i know.
> Oleg
> 
> On Jun 12, 2014, at 12:50 PM, Bikas Saha <bikas@hortonworks.com> wrote:
> 
>> Yes. TezSession.shutdown() does not block until the session is actually
>> finished. So this is a race condition. Did you manage to repro this and
>> verify the way I described earlier?
>> 
>> You can open a jira to make TezSession.shutdown() block optionally as a
>> sub-task of TEZ-690.
>> 
>> -----Original Message-----
>> From: Subroto Sanyal [mailto:sanyalsubroto@gmail.com]
>> Sent: Thursday, June 12, 2014 2:15 AM
>> To: dev@tez.incubator.apache.org
>> Subject: Re: Deadlock in DAGAppMaster during shutdown.
>> 
>> Hi Bikas,
>> 
>> I assume Step 6 should take care of closing the Tez AM (DAGAppMaster).
>> Please correct me if I am wrong.
>> My client code makes sure that first TezSession stop is invoked and then
>> only cluster is shut-down.
>> I think TezSession.stop should ensure that DAGAppMaster is down; which I
>> think it doesn't ensures.
>> 
>> Following is the Tear down code for the test(The code for TezSessionManager
>> is from PIG's TezSessionManager):
>> 
>> @Override
>> 
>>   protected void after() {
>> 
>>       // Order of shutdown is important.
>> 
>>       // If the order is reversed; the test-case will be stuck for-ever
>> 
>>       TezSessionManager.shutdown();
>> 
>>       if (_cluster != null) {
>> 
>>           _cluster.stop();
>> 
>>       }
>> 
>>   }
>> 
>> 
>> Hi Oleg,
>> 
>> Thanks Oleg for giving a try to reproduce the problem.
>> 
>> I would say the Thread dump clearly shows there is a kind of
>> dead-lock(fighting for same resource) between the AMShutdownThread and
>> AMRMHeartbeat Thread
>> 
>> 
>> On Wed, Jun 11, 2014 at 6:49 PM, Oleg Zhurakousky <
>> ozhurakousky@hortonworks.com> wrote:
>> 
>>> Also,
>>> 
>>> I was just able to reproduce it and here is the question.
>>> Are you sure that your job actually executes?
>>> Can you look at logs and see if you see something like this:
>>> 2014-06-11 12:31:46,034 INFO
>>> [ServiceThread:org.apache.tez.dag.app.rm.TaskSchedulerEventHandler]
>>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>>> 0.0.0.0/0.0.0.0:8035. Already tried 0 time(s); . . .
>>> Basically, if that’s what you have, DAGAppMaster can’t communicate
>>> with RM when it tries to register AM. In this case closing tea session
>>> does not affect DAGAppMaster and it becomes runaway process which is
>>> certainly a bug that needs to be addressed.
>>> Anyway, I have a different version of mini cluster I am using which is
>>> configurable and with that all runs fine.
>>> 
>>> Let me figure out a nice way to publish it.
>>> 
>>> Cheers
>>> Oleg
>>> 
>>> On Jun 11, 2014, at 11:48 AM, Oleg Zhurakousky <
>>> ozhurakousky@hortonworks.com> wrote:
>>> 
>>>> Subroto
>>>> 
>>>> Perhaps you can attach your client code.
>>>> I was just trying to reproduce the problem following your scenario
>>>> (as
>>> written) and DAGAppMaster shuts down successfully
>>>> 
>>>> Oleg
>>>> 
>>>> On Jun 11, 2014, at 10:05 AM, Oleg Zhurakousky <
>>> ozhurakousky@hortonworks.com> wrote:
>>>> 
>>>>> Subroto
>>>>> 
>>>>> Sorry for not getting back to you earlier I don’t believe there is
>>>>> a dedicated JIRA, but I’ve seen this and other
>>> similar problems and I’am working on it as we speak as part of a
>>> different issue.
>>>>> I’ll follow up before the end of today (EST), but feel free to
>>>>> raise
>>> JIRA
>>>>> 
>>>>> Oleg
>>>>> 
>>>>> On Jun 11, 2014, at 4:51 AM, Subroto Sanyal
>>>>> <sanyalsubroto@gmail.com>
>>> wrote:
>>>>> 
>>>>>> Hi Bikas, Hitesh
>>>>>> 
>>>>>> The Tezsession.stop() is invoked as part of my Client flow.
>>>>>> order of execution:
>>>>>> 1) Create MiniTezCluster
>>>>>> 2) Create Tez Session
>>>>>> 3) Create DAG
>>>>>> 4) Submit DAG to Tez Session and wait for completion
>>>>>> 5) Repeat step 4 for different DAGs
>>>>>> 6) Stop Tez Session
>>>>>> 7) Stop MiniTezCluster
>>>>>> 
>>>>>> PFA the container logs and thread-dump of DAGAppMaster
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jun 11, 2014 at 1:23 AM, Bikas Saha
>>>>>> <bikas@hortonworks.com>
>>> wrote:
>>>>>> Can you please clarify TezSession is stopped? Has
>>>>>> TezSession.stop()
>>> been
>>>>>> called? If not then the session app on the cluster will not stop.
>>>>>> It
>>> will
>>>>>> stop after its been idle (no DAG running) for a configurable
>>>>>> timeout
>>> period.
>>>>>> 
>>>>>> If TezSession.stop() has been called then the AM might keep
>>>>>> running and clean up existing running tasks etc. Then exit when
>>>>>> this cleanup is
>>> done.
>>>>>> TezSession.stop() is not blocking on the client. So the method can
>>> return
>>>>>> before the app exits.
>>>>>> 
>>>>>> Bikas
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Subroto Sanyal [mailto:sanyalsubroto@gmail.com]
>>>>>> Sent: Tuesday, June 10, 2014 3:27 PM
>>>>>> To: dev@tez.incubator.apache.org
>>>>>> Subject: Re: Deadlock in DAGAppMaster during shutdown.
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I have build  the Tez jars from the git repository today; still,
I
>>>>>> see
>>> the
>>>>>> DAGAppMaster running even after the TezSession is stopped.
>>>>>> Do I need to get the code/jar from somewhere else to get the fix
>>> reflected?
>>>>>> 
>>>>>> 
>>>>>> On Tue, Jun 10, 2014 at 1:54 PM, Subroto Sanyal <
>>> sanyalsubroto@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Oleg,
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks for confirming. Could you please provide the TEZ jira
>>>>>>> tickets for both of the issue where they have been solved.
>>>>>>> I couldn't find the code changes for closing TezClient.
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jun 10, 2014 at 1:25 PM, Oleg Zhurakousky <
>>>>>>> ozhurakousky@hortonworks.com> wrote:
>>>>>>> 
>>>>>>>> Subroto
>>>>>>>> 
>>>>>>>> Thanks for pointing this out.
>>>>>>>> This and the TezClient issue you’ve pointed out in your
>>>>>>>> previous email is actually being actively addressed
>>>>>>>> 
>>>>>>>> Oleg
>>>>>>>> 
>>>>>>>> On Jun 10, 2014, at 5:42 AM, Subroto Sanyal <
>>> sanyalsubroto@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> In the class AMRMClientAsyncImpl the object(7c3041e28)
is
>>>>>>>>> being locked
>>>>>>>> by
>>>>>>>>> Heartbeat thread(which kinds of run a infinite loop as
any
>>>>>>>>> heartbeat
>>>>>>>>> thread) which is requested to be locked by the method
>>>>>>>>> unregisterApplicationMaster.
>>>>>>>>> 
>>>>>>>>> Once the method unregisterApplicationMaster can lock
the
>>>>>>>>> requested
>>>>>>>> object;
>>>>>>>>> then only it can notify the heartbeat thread to exit
by a
>>>>>>>>> boolean flag keepRunning.
>>>>>>>>> 
>>>>>>>>> Following is the thread-dump for the deadlock:
>>>>>>>>> 
>>>>>>>>> "AMShutdownThread" daemon prio=5 tid=7f9a02921800
>>>>>>>>> nid=0x115d68000
>>>>>>>> waiting
>>>>>>>>> for monitor entry [115d67000]
>>>>>>>>> 
>>>>>>>>> java.lang.Thread.State: BLOCKED (on object monitor)
>>>>>>>>> 
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.unre
>>>>>>>> gisterApplicationMaster(AMRMClientAsyncImpl.java:156)
>>>>>>>>> 
>>>>>>>>> - waiting to lock <7c3041e28> (a java.lang.Object)
>>>>>>>>> 
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>> org.apache.tez.dag.app.rm.TaskScheduler.serviceStop(TaskScheduler.jav
>>>>>>>> a:394)
>>>>>>>>> 
>>>>>>>>> - locked <7c3006aa0> (a
>>>>>>>>> org.apache.tez.dag.app.rm.TaskScheduler)
>>>>>>>>> 
>>>>>>>>> at
>>>>>>>> 
>>> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2
>>>>>>>> 21)
>>>>>>>>> 
>>>>>>>>> - locked <7c3038008> (a java.lang.Object)
>>>>>>>>> 
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.serviceStop(TaskS
>>>>>>>> chedulerEventHandler.java:357)
>>>>>>>>> 
>>>>>>>>> at
>>>>>>>> 
>>> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2
>>>>>>>> 21)
>>>>>>>>> 
>>>>>>>>> - locked <7c2f71360> (a java.lang.Object)
>>>>>>>>> 
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.ja
>>>>>>>> va:52)
>>>>>>>>> 
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperat
>>>>>>>> ions.java:80)
>>>>>>>>> 
>>>>>>>>> at
>>>>>>>> 
>>> org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:15
>>>>>>>> 18)
>>>>>>>>> 
>>>>>>>>> at
>>> org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:
>>>>>>>> 1649)
>>>>>>>>> 
>>>>>>>>> - locked <7c2f51790> (a org.apache.tez.dag.app.DAGAppMaster)
>>>>>>>>> 
>>>>>>>>> at
>>>>>>>> 
>>> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2
>>>>>>>> 21)
>>>>>>>>> 
>>>>>>>>> - locked <7c2fed728> (a java.lang.Object)
>>>>>>>>> 
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>> org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShu
>>>>>>>> tdownRunnable.run(DAGAppMaster.java:607)
>>>>>>>>> 
>>>>>>>>> at java.lang.Thread.run(Thread.java:695)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> "AMRM Heartbeater thread" prio=5 tid=7f9a0c0e8800
>>>>>>>>> nid=0x111e70000
>>>>>>>> waiting
>>>>>>>>> on condition [111e6f000]
>>>>>>>>> 
>>>>>>>>> java.lang.Thread.State: TIMED_WAITING (sleeping)
>>>>>>>>> 
>>>>>>>>> at java.lang.Thread.sleep(Native Method)
>>>>>>>>> 
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>> org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(Thread
>>>>>>>> Util.java:43)
>>>>>>>>> 
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat
>>>>>>>> ionHandler.java:150)
>>>>>>>>> 
>>>>>>>>> at com.sun.proxy.$Proxy9.allocate(Unknown Source)
>>>>>>>>> 
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMCl
>>>>>>>> ientImpl.java:246)
>>>>>>>>> 
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$Hear
>>>>>>>> tbeatThread.run(AMRMClientAsyncImpl.java:224)
>>>>>>>>> 
>>>>>>>>> - locked <7c3041e28> (a java.lang.Object)
>>>>>>>>> 
>>>>>>>>> *public void
>>>>>>>>> unregisterApplicationMaster(FinalApplicationStatus
>>>>>>>> appStatus,*
>>>>>>>>> 
>>>>>>>>> *      String appMessage, String appTrackingUrl) throws
>>> YarnException,*
>>>>>>>>> 
>>>>>>>>> *      IOException {*
>>>>>>>>> 
>>>>>>>>> *    synchronized (unregisterHeartbeatLock) {*
>>>>>>>>> 
>>>>>>>>> *      keepRunning = false;*
>>>>>>>>> 
>>>>>>>>> *      client.unregisterApplicationMaster(appStatus,
appMessage,
>>>>>>>>> appTrackingUrl);*
>>>>>>>>> 
>>>>>>>>> *    }*
>>>>>>>>> 
>>>>>>>>> *  }*
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The line "keepRunning = false" should be outside the
>>>>>>>>> synchronized block.
>>>>>>>>> 
>>>>>>>>> I am not sure this should be regarded as problem in yarn
or TEZ.
>>>>>>>>> The
>>>>>>>> flag
>>>>>>>>> is private and can't be accessed by Tez implementation
>>>>>>>> TezAMRMClientAsync.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Cheers,
>>>>>>>>> *Subroto Sanyal*
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>> NOTICE: This message is intended for the use of the individual
>>>>>>>> or entity to which it is addressed and may contain information
>>>>>>>> that is confidential, privileged and exempt from disclosure
>>>>>>>> under applicable law. If the reader of this message is not
the
>>>>>>>> intended recipient,
>>> you
>>>>>>>> are hereby notified that any printing, copying, dissemination,
>>>>>>>> distribution, disclosure or forwarding of this communication
is
>>>>>>>> strictly prohibited. If you have received this communication
in
>>>>>>>> error, please contact the sender immediately and delete it
from
>>>>>>>> your system. Thank You.
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Cheers,
>>>>>>> *Subroto Sanyal*
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Cheers,
>>>>>> *Subroto Sanyal*
>>>>>> 
>>>>>> --
>>>>>> CONFIDENTIALITY NOTICE
>>>>>> NOTICE: This message is intended for the use of the individual or
>>> entity to
>>>>>> which it is addressed and may contain information that is
>>>>>> confidential, privileged and exempt from disclosure under
>>>>>> applicable law. If the
>>> reader
>>>>>> of this message is not the intended recipient, you are hereby
>>>>>> notified
>>> that
>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>> forwarding of this communication is strictly prohibited. If you
>>>>>> have received this communication in error, please contact the
>>>>>> sender
>>> immediately
>>>>>> and delete it from your system. Thank You.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Cheers,
>>>>>> Subroto Sanyal
>>>>>> <AMLogs.zip>
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or
>>> entity to which it is addressed and may contain information that is
>>> confidential, privileged and exempt from disclosure under applicable
>>> law. If the reader of this message is not the intended recipient, you
>>> are hereby notified that any printing, copying, dissemination,
>>> distribution, disclosure or forwarding of this communication is
>>> strictly prohibited. If you have received this communication in error,
>>> please contact the sender immediately and delete it from your system.
>>> Thank You.
>>> 
>> 
>> 
>> 
>> --
>> Cheers,
>> *Subroto Sanyal*
>> 
>> -- 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to 
>> which it is addressed and may contain information that is confidential, 
>> privileged and exempt from disclosure under applicable law. If the reader 
>> of this message is not the intended recipient, you are hereby notified that 
>> any printing, copying, dissemination, distribution, disclosure or 
>> forwarding of this communication is strictly prohibited. If you have 
>> received this communication in error, please contact the sender immediately 
>> and delete it from your system. Thank You.
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message