hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joshua Caplan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-4796) Corrupt map outputs created via native Snappy compression
Date Wed, 14 Nov 2012 01:52:12 GMT
Joshua Caplan created MAPREDUCE-4796:

             Summary: Corrupt map outputs created via native Snappy compression
                 Key: MAPREDUCE-4796
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4796
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: task, tasktracker
    Affects Versions: 1.0.3
         Environment: Amazon Elastic MapReduce, m1.xlarge (Debian/Squeeze, amd64)
            Reporter: Joshua Caplan

I am observing cases where a single host in a cluster of 150 slaves "goes bad" w.r.t. Snappy

Many, but not all, of its map-phase tasks produce the buggy exception message "java.lang.ClassNotFoundException:
Ljava.lang.InternalError" (see HADOOP-8151) during on-disk merging, and then a smattering
of reducer tasks across the cluster report the same message on every attempt during the "reduce
> reduce" phase, leading to job failure with no manual intervention.  If I log into the
rogue host and kill its tasktracker process while the job is still running, Hadoop's self-healing
(rescheduling the map tasks from the dead tasktracker) seems to fix the next reducer attempt
for each of the formerly-doomed reducer tasks, and the job succeeds.  Subsequent jobs on the
same cluster show a different message on occasion as well on that same bad host: "org.apache.hadoop.fs.ChecksumException:
Checksum Error".

This evidence leads me to believe that some of the intermediate map output was corrupted by
the file system, but this condition was only caught when those writes occurred during merging
(and not caught when the last write was the corrupt one).

The strategy for aggressively detecting shuffle failures via exception regex matching (MAPREDUCE-2529)
might be useful as a way to solve this case as well; if a tasktracker process could commit
suicide if it detected this issue often enough, we would have no reason to manually intervene.
 Unfortunately, I'm only seeing this message show up after the shuffle phase is finished;
we would need to scan for this exception during the map phase.

I did not see this issue occur on the previous version of Hadoop we were using on Amazon EMR
(0.20) using lzo compression for intermediate map outputs.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message