hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (MAPREDUCE-4852) Reducer should not signal fetch failures for disk errors on the reducer's side
Date Wed, 23 Apr 2014 19:04:18 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jason Lowe resolved MAPREDUCE-4852.

    Resolution: Duplicate

This was fixed by MAPREDUCE-5251.

> Reducer should not signal fetch failures for disk errors on the reducer's side
> ------------------------------------------------------------------------------
>                 Key: MAPREDUCE-4852
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4852
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Jason Lowe
> Ran across a case where a reducer ran on a node where the disks were full, leading to
an exception like this during the shuffle fetch:
> {noformat}
> 2012-12-05 09:07:28,749 INFO [fetcher#25] org.apache.hadoop.mapreduce.task.reduce.MergeManager:
attempt_1352354913026_138167_m_000654_0: Shuffling to disk since 235056188 is greater than
maxSingleShuffleLimit (155104064)
> 2012-12-05 09:07:28,755 INFO [fetcher#25] org.apache.hadoop.mapreduce.task.reduce.Fetcher:
fetcher#25 failed to read map headerattempt_1352354913026_138167_m_000654_0 decomp: 235056188,
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local
directory for output/attempt_1352354913026_138167_r_000189_0/map_654.out
> 	at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
> 	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
> 	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
> 	at org.apache.hadoop.mapred.YarnOutputFiles.getInputFileForWrite(YarnOutputFiles.java:213)
> 	at org.apache.hadoop.mapreduce.task.reduce.MapOutput.<init>(MapOutput.java:81)
> 	at org.apache.hadoop.mapreduce.task.reduce.MergeManager.reserve(MergeManager.java:245)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:348)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:283)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:155)
> 2012-12-05 09:07:28,755 WARN [fetcher#25] org.apache.hadoop.mapreduce.task.reduce.Fetcher:
copyMapOutput failed for tasks [attempt_1352354913026_138167_m_000654_0]
> 2012-12-05 09:07:28,756 INFO [fetcher#25] org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler:
Reporting fetch failure for attempt_1352354913026_138167_m_000654_0 to jobtracker.
> {noformat}
> Even though the error was local to the reducer, it reported the error as a fetch failure
to the AM than failing the reducer itself.  It then proceeded to run into the same error for
many other maps, causing them to relaunch from reported fetch failures.  In this case it would
have been better to fail the reducer and try another node rather than blame the mapper for
what is an error on the reducer's side.

This message was sent by Atlassian JIRA

View raw message