tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hitesh Sharma (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TEZ-4011) Don't consider task attempt failed if container fails to launch and thus times out
Date Thu, 18 Oct 2018 19:53:00 GMT
Hitesh Sharma created TEZ-4011:
----------------------------------

             Summary: Don't consider task attempt failed if container fails to launch and
thus times out
                 Key: TEZ-4011
                 URL: https://issues.apache.org/jira/browse/TEZ-4011
             Project: Apache Tez
          Issue Type: Improvement
            Reporter: Hitesh Sharma


If a container fails to start (never heartbeats back to the AM during launch) then the container
is considered timed out and the task attempt assigned to the container is failed. This is
counted towards the failure count for the task. In some environments this may not be desirable
(due to high probability of these events) as the task itself never got the chance to run,
but since it counts towards the max task attempts it could lead to a failure. If we configure
the timeout value for container heartbeat to be bigger then we have job slowness as the job
just waits for the container to launch. An alternative here is to instead kill the task attempt
if the container times out during launch. This is because the killed containers are not counted
towards task attempt failures and allows one to have a bit more aggressive launch timeout.
This behavior would be off by default and users could opt into it if it makes more sense to
do so in their environments.

 

Thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message