tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Subroto Sanyal <sanyalsubr...@gmail.com>
Subject Re: Deadlock in DAGAppMaster during shutdown.
Date Thu, 12 Jun 2014 09:14:55 GMT
Hi Bikas,

I assume Step 6 should take care of closing the Tez AM (DAGAppMaster).
Please correct me if I am wrong.
My client code makes sure that first TezSession stop is invoked and then
only cluster is shut-down.
I think TezSession.stop should ensure that DAGAppMaster is down; which I
think it doesn't ensures.

Following is the Tear down code for the test(The code for TezSessionManager
is from PIG's TezSessionManager):

@Override

    protected void after() {

        // Order of shutdown is important.

        // If the order is reversed; the test-case will be stuck for-ever

        TezSessionManager.shutdown();

        if (_cluster != null) {

            _cluster.stop();

        }

    }


Hi Oleg,

Thanks Oleg for giving a try to reproduce the problem.

I would say the Thread dump clearly shows there is a kind of
dead-lock(fighting for same resource) between the AMShutdownThread and
AMRMHeartbeat Thread


On Wed, Jun 11, 2014 at 6:49 PM, Oleg Zhurakousky <
ozhurakousky@hortonworks.com> wrote:

> Also,
>
> I was just able to reproduce it and here is the question.
> Are you sure that your job actually executes?
> Can you look at logs and see if you see something like this:
> 2014-06-11 12:31:46,034 INFO
> [ServiceThread:org.apache.tez.dag.app.rm.TaskSchedulerEventHandler]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8035. Already tried 0 time(s);
> . . .
> Basically, if that’s what you have, DAGAppMaster can’t communicate with RM
> when it tries to register AM. In this case closing tea session does not
> affect DAGAppMaster and it becomes runaway process which is certainly a bug
> that needs to be addressed.
> Anyway, I have a different version of mini cluster I am using which is
> configurable and with that all runs fine.
>
> Let me figure out a nice way to publish it.
>
> Cheers
> Oleg
>
> On Jun 11, 2014, at 11:48 AM, Oleg Zhurakousky <
> ozhurakousky@hortonworks.com> wrote:
>
> > Subroto
> >
> > Perhaps you can attach your client code.
> > I was just trying to reproduce the problem following your scenario (as
> written) and DAGAppMaster shuts down successfully
> >
> > Oleg
> >
> > On Jun 11, 2014, at 10:05 AM, Oleg Zhurakousky <
> ozhurakousky@hortonworks.com> wrote:
> >
> >> Subroto
> >>
> >> Sorry for not getting back to you earlier
> >> I don’t believe there is a dedicated JIRA, but I’ve seen this and other
> similar problems and I’am working on it as we speak as part of a different
> issue.
> >> I’ll follow up before the end of today (EST), but feel free to raise
> JIRA
> >>
> >> Oleg
> >>
> >> On Jun 11, 2014, at 4:51 AM, Subroto Sanyal <sanyalsubroto@gmail.com>
> wrote:
> >>
> >>> Hi Bikas, Hitesh
> >>>
> >>> The Tezsession.stop() is invoked as part of my Client flow.
> >>> order of execution:
> >>> 1) Create MiniTezCluster
> >>> 2) Create Tez Session
> >>> 3) Create DAG
> >>> 4) Submit DAG to Tez Session and wait for completion
> >>> 5) Repeat step 4 for different DAGs
> >>> 6) Stop Tez Session
> >>> 7) Stop MiniTezCluster
> >>>
> >>> PFA the container logs and thread-dump of DAGAppMaster
> >>>
> >>>
> >>> On Wed, Jun 11, 2014 at 1:23 AM, Bikas Saha <bikas@hortonworks.com>
> wrote:
> >>> Can you please clarify TezSession is stopped? Has TezSession.stop()
> been
> >>> called? If not then the session app on the cluster will not stop. It
> will
> >>> stop after its been idle (no DAG running) for a configurable timeout
> period.
> >>>
> >>> If TezSession.stop() has been called then the AM might keep running and
> >>> clean up existing running tasks etc. Then exit when this cleanup is
> done.
> >>> TezSession.stop() is not blocking on the client. So the method can
> return
> >>> before the app exits.
> >>>
> >>> Bikas
> >>>
> >>> -----Original Message-----
> >>> From: Subroto Sanyal [mailto:sanyalsubroto@gmail.com]
> >>> Sent: Tuesday, June 10, 2014 3:27 PM
> >>> To: dev@tez.incubator.apache.org
> >>> Subject: Re: Deadlock in DAGAppMaster during shutdown.
> >>>
> >>> Hi,
> >>>
> >>> I have build  the Tez jars from the git repository today; still, I see
> the
> >>> DAGAppMaster running even after the TezSession is stopped.
> >>> Do I need to get the code/jar from somewhere else to get the fix
> reflected?
> >>>
> >>>
> >>> On Tue, Jun 10, 2014 at 1:54 PM, Subroto Sanyal <
> sanyalsubroto@gmail.com>
> >>> wrote:
> >>>
> >>> > Hi Oleg,
> >>> >
> >>> >
> >>> > Thanks for confirming. Could you please provide the TEZ jira tickets
> >>> > for both of the issue where they have been solved.
> >>> > I couldn't find the code changes for closing TezClient.
> >>> >
> >>> >
> >>> > On Tue, Jun 10, 2014 at 1:25 PM, Oleg Zhurakousky <
> >>> > ozhurakousky@hortonworks.com> wrote:
> >>> >
> >>> >> Subroto
> >>> >>
> >>> >> Thanks for pointing this out.
> >>> >> This and the TezClient issue you’ve pointed out in your previous
> >>> >> email is actually being actively addressed
> >>> >>
> >>> >> Oleg
> >>> >>
> >>> >> On Jun 10, 2014, at 5:42 AM, Subroto Sanyal <
> sanyalsubroto@gmail.com>
> >>> >> wrote:
> >>> >>
> >>> >> > In the class AMRMClientAsyncImpl the object(7c3041e28) is
being
> >>> >> > locked
> >>> >> by
> >>> >> > Heartbeat thread(which kinds of run a infinite loop as any
> >>> >> > heartbeat
> >>> >> > thread) which is requested to be locked by the method
> >>> >> > unregisterApplicationMaster.
> >>> >> >
> >>> >> > Once the method unregisterApplicationMaster can lock the requested
> >>> >> object;
> >>> >> > then only it can notify the heartbeat thread to exit by a
boolean
> >>> >> > flag keepRunning.
> >>> >> >
> >>> >> > Following is the thread-dump for the deadlock:
> >>> >> >
> >>> >> > "AMShutdownThread" daemon prio=5 tid=7f9a02921800 nid=0x115d68000
> >>> >> waiting
> >>> >> > for monitor entry [115d67000]
> >>> >> >
> >>> >> >   java.lang.Thread.State: BLOCKED (on object monitor)
> >>> >> >
> >>> >> > at
> >>> >> >
> >>> >>
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.unre
> >>> >> gisterApplicationMaster(AMRMClientAsyncImpl.java:156)
> >>> >> >
> >>> >> > - waiting to lock <7c3041e28> (a java.lang.Object)
> >>> >> >
> >>> >> > at
> >>> >> >
> >>> >>
> org.apache.tez.dag.app.rm.TaskScheduler.serviceStop(TaskScheduler.jav
> >>> >> a:394)
> >>> >> >
> >>> >> > - locked <7c3006aa0> (a org.apache.tez.dag.app.rm.TaskScheduler)
> >>> >> >
> >>> >> > at
> >>> >>
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2
> >>> >> 21)
> >>> >> >
> >>> >> > - locked <7c3038008> (a java.lang.Object)
> >>> >> >
> >>> >> > at
> >>> >> >
> >>> >>
> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.serviceStop(TaskS
> >>> >> chedulerEventHandler.java:357)
> >>> >> >
> >>> >> > at
> >>> >>
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2
> >>> >> 21)
> >>> >> >
> >>> >> > - locked <7c2f71360> (a java.lang.Object)
> >>> >> >
> >>> >> > at
> >>> >> >
> >>> >>
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.ja
> >>> >> va:52)
> >>> >> >
> >>> >> > at
> >>> >> >
> >>> >>
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperat
> >>> >> ions.java:80)
> >>> >> >
> >>> >> > at
> >>> >>
> org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:15
> >>> >> 18)
> >>> >> >
> >>> >> > at
> org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:
> >>> >> 1649)
> >>> >> >
> >>> >> > - locked <7c2f51790> (a org.apache.tez.dag.app.DAGAppMaster)
> >>> >> >
> >>> >> > at
> >>> >>
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2
> >>> >> 21)
> >>> >> >
> >>> >> > - locked <7c2fed728> (a java.lang.Object)
> >>> >> >
> >>> >> > at
> >>> >> >
> >>> >>
> org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShu
> >>> >> tdownRunnable.run(DAGAppMaster.java:607)
> >>> >> >
> >>> >> > at java.lang.Thread.run(Thread.java:695)
> >>> >> >
> >>> >> >
> >>> >> > "AMRM Heartbeater thread" prio=5 tid=7f9a0c0e8800 nid=0x111e70000
> >>> >> waiting
> >>> >> > on condition [111e6f000]
> >>> >> >
> >>> >> >   java.lang.Thread.State: TIMED_WAITING (sleeping)
> >>> >> >
> >>> >> > at java.lang.Thread.sleep(Native Method)
> >>> >> >
> >>> >> > at
> >>> >> >
> >>> >>
> org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(Thread
> >>> >> Util.java:43)
> >>> >> >
> >>> >> > at
> >>> >> >
> >>> >>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat
> >>> >> ionHandler.java:150)
> >>> >> >
> >>> >> > at com.sun.proxy.$Proxy9.allocate(Unknown Source)
> >>> >> >
> >>> >> > at
> >>> >> >
> >>> >>
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMCl
> >>> >> ientImpl.java:246)
> >>> >> >
> >>> >> > at
> >>> >> >
> >>> >>
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$Hear
> >>> >> tbeatThread.run(AMRMClientAsyncImpl.java:224)
> >>> >> >
> >>> >> > - locked <7c3041e28> (a java.lang.Object)
> >>> >> >
> >>> >> > *public void unregisterApplicationMaster(FinalApplicationStatus
> >>> >> appStatus,*
> >>> >> >
> >>> >> > *      String appMessage, String appTrackingUrl) throws
> YarnException,*
> >>> >> >
> >>> >> > *      IOException {*
> >>> >> >
> >>> >> > *    synchronized (unregisterHeartbeatLock) {*
> >>> >> >
> >>> >> > *      keepRunning = false;*
> >>> >> >
> >>> >> > *      client.unregisterApplicationMaster(appStatus, appMessage,
> >>> >> > appTrackingUrl);*
> >>> >> >
> >>> >> > *    }*
> >>> >> >
> >>> >> > *  }*
> >>> >> >
> >>> >> >
> >>> >> > The line "keepRunning = false" should be outside the synchronized
> >>> >> > block.
> >>> >> >
> >>> >> > I am not sure this should be regarded as problem in yarn or
TEZ.
> >>> >> > The
> >>> >> flag
> >>> >> > is private and can't be accessed by Tez implementation
> >>> >> TezAMRMClientAsync.
> >>> >> >
> >>> >> >
> >>> >> > --
> >>> >> > Cheers,
> >>> >> > *Subroto Sanyal*
> >>> >>
> >>> >>
> >>> >> --
> >>> >> CONFIDENTIALITY NOTICE
> >>> >> NOTICE: This message is intended for the use of the individual
or
> >>> >> entity to which it is addressed and may contain information that
is
> >>> >> confidential, privileged and exempt from disclosure under applicable
> >>> >> law. If the reader of this message is not the intended recipient,
> you
> >>> >> are hereby notified that any printing, copying, dissemination,
> >>> >> distribution, disclosure or forwarding of this communication is
> >>> >> strictly prohibited. If you have received this communication in
> >>> >> error, please contact the sender immediately and delete it from
your
> >>> >> system. Thank You.
> >>> >>
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > Cheers,
> >>> > *Subroto Sanyal*
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Cheers,
> >>> *Subroto Sanyal*
> >>>
> >>> --
> >>> CONFIDENTIALITY NOTICE
> >>> NOTICE: This message is intended for the use of the individual or
> entity to
> >>> which it is addressed and may contain information that is confidential,
> >>> privileged and exempt from disclosure under applicable law. If the
> reader
> >>> of this message is not the intended recipient, you are hereby notified
> that
> >>> any printing, copying, dissemination, distribution, disclosure or
> >>> forwarding of this communication is strictly prohibited. If you have
> >>> received this communication in error, please contact the sender
> immediately
> >>> and delete it from your system. Thank You.
> >>>
> >>>
> >>>
> >>> --
> >>> Cheers,
> >>> Subroto Sanyal
> >>> <AMLogs.zip>
> >>
> >
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
Cheers,
*Subroto Sanyal*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message