[ https://issues.apache.org/jira/browse/TEZ-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Eagles resolved TEZ-3549.
----------------------------------
Resolution: Fixed
Thanks, [~kshukla].
> TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS correctly
> ---------------------------------------------------------------------------------
>
> Key: TEZ-3549
> URL: https://issues.apache.org/jira/browse/TEZ-3549
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.1
> Reporter: Kuhu Shukla
> Assignee: Kuhu Shukla
> Fix For: 0.7.2, 0.9.0, 0.8.5
>
> Attachments: TEZ-3549-branch-0.7.001.patch, TEZ-3549.001.patch, TEZ-3549.002.patch,
TEZ-3549.003.patch
>
>
> {code}
> hadoop jar /home/gs/tez/current/tez-examples-x.x.x.jar orderedwordcount -Dtez.task.progress.stuck.interval-ms=700000
/input /output
> {code}
> With {{tez.task.am.heartbeat.interval-ms.max=3000}} and {{timeout-ms=300000}}
> fails as follows:
> {code}
> INFO client.DAGClientImpl: DAG completed. FinalState=FAILED
> INFO examples.OrderedWordCount: DAG diagnostics: [Vertex failed, vertexName=Tokenizer,
vertexId=vertex_123_456_1_00, diagnostics=[Task failed, taskId=task_123_456_1_00_000007, diagnostics=[TaskAttempt
0 failed, info=[Attempt failed because it appears to make no progress for 700000ms], TaskAttempt
1 failed, info=[Attempt failed because it appears to make no progress for 700000ms], TaskAttempt
2 failed, info=[Attempt failed because it appears to make no progress for 700000ms], TaskAttempt
3 failed, info=[Attempt failed because it appears to make no progress for 700000ms]], Vertex
did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:51, Vertex vertex_123_456_1_00
[Tokenizer] killed/failed due to:OWN_TASK_FAILURE], Vertex killed, vertexName=Sorter, vertexId=vertex_123_456_1_02,
diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE,
failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_02 [Sorter] killed/failed due to:OTHER_VERTEX_FAILURE],
Vertex killed, vertexName=Summation, vertexId=vertex_123_456_1_01, diagnostics=[Vertex received
Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0
killedTasks:1, Vertex vertex_123_456_1_01 [Summation] killed/failed due to:OTHER_VERTEX_FAILURE],
DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:2]
> {code}
> This is because in TaskAttemptImpl, {{lastNotifyProgressTimestamp}} is 0 until progress
gets notified and the following code always evaluates to true if this config is set/non-zero.
> TaskUpdaterTransition:
> {code}
> if (statusEvent.getProgressNotified()) {
> ta.lastNotifyProgressTimestamp = ta.clock.getTime();
> } else {
> long currTime = ta.clock.getTime();
> if (ta.hungIntervalMax > 0 &&
> currTime - ta.lastNotifyProgressTimestamp > ta.hungIntervalMax) {
> {code}
> Ideally lastNotifyProgressTimestamp should be a timestamp from the onset so that this
comparison is valid.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
|