hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kamal Kc (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-6131) Integer overflow in RMContainerAllocator results in starvation of applications
Date Fri, 17 Oct 2014 23:49:33 GMT
Kamal Kc created MAPREDUCE-6131:

             Summary: Integer overflow in RMContainerAllocator results in starvation of applications
                 Key: MAPREDUCE-6131
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6131
             Project: Hadoop Map/Reduce
          Issue Type: Bug
            Reporter: Kamal Kc

When processing large datasets, Hadoop encounters a scenario where all
 containers run reduce tasks and no map tasks are scheduled. The 
application does not fail but rather remains in this state without making 
any forward progress. It then has to be manually terminated. 

This bug is due to integer overflow in scheduleReduces() of 
RMContainerAllocator. The variable netScheduledMapMem overflows for 
large data sizes, takes negative value, and results in a large 
finalReduceMemLimit and a large rampup value. In almost all cases, this 
large rampup value is greater than the total number of reduce tasks. 
Therefore, the AM tries to assign all reduce tasks. And if the total number 
of reduce tasks is greater than the total container slots, then all slots are 
taken up by reduce tasks, leaving none for maps. 

With 128MB block size and 2GB map container size, overflow occurs with 128 TB data size. An
example scenario for the reproduction is: 

- Input data size of 32TB, block size 128MB, Map container size = 10GB,
reduce container size = 10GB, #reducers = 50,  cluster mem capacity =  7 x 40GB, slowstart=0.0

Better resolution might be to change the variables used in 
RMContainerAllocator from int to long. A simpler fix instead would be to 
only change the local variables of scheduleReduces() to long data types. 
Patch is attached for 2.2.0. 

This message was sent by Atlassian JIRA

View raw message