tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ganesha Shreedhara (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TEZ-4063) DAGClient:tryKillDAG taking long time
Date Fri, 19 Apr 2019 06:54:00 GMT
Ganesha Shreedhara created TEZ-4063:

             Summary: DAGClient:tryKillDAG taking long time
                 Key: TEZ-4063
                 URL: https://issues.apache.org/jira/browse/TEZ-4063
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Ganesha Shreedhara

Hive uses DAGClient:tryKillDAG() to kill tez application. It is taking time to kill when
there are too many tasks getting processed. This is because the kill event is getting added
to eventQueue and it takes time when the eventQueue has too many events before the kill the

I have a job which has ~3L mappers, ~5K reducers and ~1000 parallel tasks running.

When hive query is killed in the middle of this job getting processed, it takes ~6mins for
the tasks to start getting killed. It is taking ~3mins for the kill event from AM to reach
the DAG and ~3mins again for the kill event from DAG to reach the vertex.


Below is the log for the same:
2019-04-10 15:11:35,776 [INFO] [IPC Server handler 0 on 44129] |app.DAGAppMaster|: Sending
a kill event to the current DAG, dagId=dag_1554789825317_0535_1
2019-04-10 15:11:35,785 [INFO] [IPC Server handler 0 on 44129] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1554789825317_0535_1][Event:DAG_KILL_REQUEST]: org.apache.tez.dag.history.events.DAGKillRequestEvent@731f79f4
~ 3 mins of delay
2019-04-10 15:14:34,171 [INFO] [Dispatcher thread \{Central}] |impl.DAGImpl|: Dag received
~ 3 mins of delay
2019-04-10 15:17:52,434 [INFO] [Dispatcher thread \{Central}] |impl.VertexImpl|: Killing tasks
in vertex: vertex_1554789825317_0535_1_01 [Reducer 2] due to trigger: DAG_TERMINATED
2019-04-10 15:17:52,439 [INFO] [Dispatcher thread \{Central}] |impl.VertexImpl|: Killing tasks
in vertex: vertex_1554789825317_0535_1_00 [Map 1] due to trigger: DAG_TERMINATED

Pig uses TezClient:stop() method which kills application in asynchronous manner. It also
uses tez.client.timeout-ms configuration which can be configured to kill the yarn application
if the client timeout exceeds a threshold value. 


Is this an expected behaviour to add kill event to eventQueue and process it synchronously
when DAGClient:tryKillDAG() is called? 

Can we process the kill event immediately (may be when a configuration is enabled) if the
user doesn't want the past events to be processed? 



This message was sent by Atlassian JIRA

View raw message