hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ling Kun <lkun.e...@gmail.com>
Subject Why In-memory Mapoutput is necessary in ReduceCopier
Date Mon, 11 Mar 2013 10:27:26 GMT
Dear all,

     I am focusing on the Mapoutput copier implementation. This part of
code will try to get mapoutputs, and merge them into a file that can feed
to reduce functions. I have the following questions.

1. All the local file mapoutput data will be merged together by the
LocalFSMerge, and the in-memory mapout will be merged by
InMemFSMergeThread. For the InMemFSMergeThread, there is also a writer
object   which write the result to outputPath ( ReduceTask.java Line 2843).
It seems after merging, in-memory mapoutput and local file mapoutput data
will all be stored in local file system. Why not just using the local file
for all mapoutput data.

2. After using http to get  some fragment of a map output file, some of the
mapoutput data will be selected and keep in memory, while others are
directly write to local disk of reducers. Which mapoutput wil be kept in
memory is determined in MapOutputCopier.getMapOutput(), this method will
call ramManager.canFitInMemory().  why not store all the data to disk?

3. According to the comment, Hadoop will put a file in memory if it meets:
a, the size of the (decompressed) file should be less than 25% of the total
inmem fs; b, there is space available in the inmem fs. Why ? Is it because
of the performance?


Ling Kun


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message