hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tao Jie (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-7110) Support delayed retry for MR task attempts
Date Mon, 11 Jun 2018 08:50:00 GMT
Tao Jie created MAPREDUCE-7110:
----------------------------------

             Summary: Support delayed retry for MR task attempts
                 Key: MAPREDUCE-7110
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7110
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
    Affects Versions: 3.1.0, 2.8.2
            Reporter: Tao Jie
            Assignee: Tao Jie


Today when map/reduce task fails, it would retry 4 times until success by default.
In our product cluster, datanodes may be offline for a while. In a map task, when the 3 datanodes
on which the accessed block replicated go offline at the same time, this map attempt will
fail. However in current logic the appmaster will launch the retry attempts immediately, and
the retries will very likely fail again if those datanodes do not recover very soon. As a
result, it will cauce the job to fail even the job has been running for several hours.
In such a situation, we could have a delayed retry mechanism. For example we can have the
first retry immediately, then the second retry will wait for 10s, the third retry will wait
longer.
It could be an option especially for jobs that runs for a long time and will not modify the
current logic by default. 
Does it make sense? Any thought?




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org


Mime
View raw message