hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted
Date Mon, 05 Nov 2012 20:20:12 GMT
Robert Joseph Evans created MAPREDUCE-4772:

             Summary: Fetch failures can take way too long for a map to be restarted
                 Key: MAPREDUCE-4772
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2
    Affects Versions: 0.23.4
            Reporter: Robert Joseph Evans
            Assignee: Robert Joseph Evans
            Priority: Critical

In one particular case we saw a NM go down at just the right time, that most of the reducers
got the output of the map tasks, but not all of them.

The ones that failed to get the output reported to the AM rather quickly that they could not
fetch from the NM, but because the other reducers were still running the AM would not relaunch
the map task because there weren't more than 50% of the running reducers that had reported
fetch failures.  Then because of the exponential back-off for fetches on the reducers it took
until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again.
At that point the other reducers had finished and the job relaunched the map task.  If the
reducers had still been running at 1:45 I have no idea how long it would have taken for each
of the tasks to get to 30 fetch failures.

We need to trigger the map based off of percentage of reducers shuffling, not percentage of
reducers running, we also need to have a maximum limit of the back off, so that we don't ever
have the reducer waiting for days to try and fetch map output.  

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message