hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Emilio Coppa (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-5958) Wrong reduce task progress if map output is compressed
Date Sat, 05 Jul 2014 16:21:33 GMT
Emilio Coppa created MAPREDUCE-5958:
---------------------------------------

             Summary: Wrong reduce task progress if map output is compressed
                 Key: MAPREDUCE-5958
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5958
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 2.4.1, 2.2.1, 2.3.0, 2.2.0, 2.4.0
            Reporter: Emilio Coppa
            Priority: Minor


If the map output is compressed (_mapreduce.map.output.compress_ set to _true_) then the reduce
task progress may be highly underestimated.

In the reduce phase (but also in the merge phase), the progress of a reduce task is computed
as the ratio between the number of processed bytes and the number of total bytes. But:

- the number of total bytes is computed by summing up the uncompressed segment sizes (_Merger.Segment.getRawDataLength()_)

- the number of processed bytes is computed by exploiting the position of the current _IFile.Reader_
(using _IFile.Reader.getPosition()_) but this may refer to the position in the underlying
on disk file (which may be compressed)

Thus, if the map output are compressed then the progress may be underestimated (e.g., only
1 map output ondisk file, the compressed file is 25% of its original size, then the reduce
task progress during the reduce phase will range between 0 and 0.25 and then artificially
jump to 1.0).

Attached there is a patch: the number of processed bytes is now computed by exploiting _IFile.Reader.bytesRead_
(if the the reader is in memory, then _getPosition()_ already returns exactly this field).




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message