hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Elliott (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-4947) Random task failures during TeraSort job
Date Fri, 18 Jan 2013 14:36:13 GMT
John Elliott created MAPREDUCE-4947:
---------------------------------------

             Summary: Random task failures during TeraSort job
                 Key: MAPREDUCE-4947
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4947
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 1.0.1, 1.0.0, 0.20.205.0
         Environment: RHEL 6.2
4 datanodes
    one xfs filesystem per datanode
    2 quad core CPU's per datanode
    48 GB memory per datanode
10GbE node interconnect
jdk1.6.0_32
            Reporter: John Elliott
            Priority: Minor


During most of my terasort jobs, I see occasional, random map task failures during the reduce
phase.  Usually there will be only 1-4 task failures during a job, with the job completing
successfully.  On rare occasions, a tasktracker will be blacklisted.  Below are the usual
error messages:
========================================
NFO mapred.JobClient: Task Id : attempt_201301151521_0002_m_005954_0, Status : FAILED
java.lang.Throwable: Child Error
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 126.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
WARN mapred.JobClient: Error reading task outputhttp://datanode3:50060/tasklog?plaintext=true&attemptid=attempt_201301151521_0002_m_005954_0&filter=stdout
WARN mapred.JobClient: Error reading task outputhttp://datanode3:50060/tasklog?plaintext=true&attemptid=attempt_201301151521_0002_m_005954_0&filter=stderr
==========================================
Tasktracker nodes are considered for 8 map and 7 reduce slots each for a total of 32 map slots
and 28 reduce slots for the 4 datanode cluster.

The problem never occurs, during teragen jobs and only occur after reduce copies start.  Cutting
the number of slots in half helps to reduce the frequency, but the problem still occurs.


Actions taken without any success:
ulimit increases for nproc and nofile to 32768 and then 65536
setting MALLOC_ARENA_MAX=4 in the hadoop-env.sh file per HADOOP-7154.
heapsize increases and reductions
reduction of map and reduce slots as stated above
various modifications of mapreduce and hdfs properties

I've done quite a bit of testing with CDH3 on the same hardware and not encountered this problem,
so I suspect there may be a bug fix or patch I'm missing.  Any suggestions for further isolating
the problem or application of patches would be much appreciated.

Thanks in advance!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message