tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TEZ-3846) Tez session may not clean up on internal error
Date Thu, 28 Sep 2017 22:48:00 GMT
Sergey Shelukhin created TEZ-3846:
-------------------------------------

             Summary: Tez session may not clean up on internal error
                 Key: TEZ-3846
                 URL: https://issues.apache.org/jira/browse/TEZ-3846
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Sergey Shelukhin


Normally, in Hive we blindly reopen the session on any error; however I accidentally broke
that, and while investigating noticed a new error before reopen that claims that session where
a DAG has failed is still running a DAG. Looks like it should either clean up, or if we assume
OOM is not clean-up-able, die completely.
{noformat}
2017-09-28T01:07:12,352  INFO [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] client.TezClient:
Submitted dag to TezSession, sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, applicationId=application_1506585924598_0001,
dagId=dag_1506585924598_0001_53, dagName=SELECT count(1) FROM (
...
2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] SessionState: Status:
Failed
2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] SessionState: Vertex
failed, vertexName=Map 61, vertexId=vertex_1506585924598_0001_53_01, diagnostics=[Vertex vertex_1506585924598_0001_53_01
[Map 61] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: src initializer failed,
vertex=vertex_1506585924598_0001_53_01 [Map 61], java.lang.OutOfMemoryError: GC overhead limit
exceeded
2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] SessionState: Invalid
event V_INTERNAL_ERROR on Vertex vertex_1506585924598_0001_53_00 [Map 60]
2017-09-28T01:07:25,787 DEBUG [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] log.PerfLogger:
</PERFLOG method=TezRunDag start=1506586032352 end=1506586045787 duration=13435 from=org.apache.hadoop.hive.ql.exec.tez.monitoring.TezJobMonitor>
... [reuse]
2017-09-28T01:07:28,459  INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] client.TezClient:
Submitting dag to TezSession, sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, applicationId=application_1506585924598_0001,
dagName=insert overwrite table orc_ppd_staging s...s(Stage-1), callerContext={ context=HIVE,
callerType=HIVE_QUERY_ID, callerId=hiveptest_20170928010728_58f19d98-85da-4fad-83a7-7bf3aa0252a7
}
2017-09-28T01:07:35,259  INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] exec.Task: Dag submit
failed due to App master already running a DAG
{noformat}
Session continues living and failing like that multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message