hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anty <anty....@gmail.com>
Subject Bugs in hadoop-0.23.0
Date Wed, 18 Jan 2012 06:05:25 GMT
  After reading/testing the code of reduce phase code in hadoop-0.23.0,  I
think i found some bugs, maybe helpful to you .
1)  in the constructor of OnDiskMerger in MergeManager.java on line 472,
   *public OnDiskMerger(MergeManager<K, V> manager) {
     super(manager, Integer.MAX_VALUE, exceptionReporter);
     setName("OnDiskMerger - Thread to merge on-disk map-outputs");
the second parameter mergeFactor in constructor can't be *Integer.MAX_VALUE,
*it should be io.sort.factor. if set mergeFactor to *Integer.MAX_VALUE*
the OnDiskMerger will merge all files feed to it at a time , don't show
respect to io.sort.factor parameter.

2) still in MergeManager.java on line 90, the data structure
  *Set<Path> onDiskMapOutputs = new TreeSet<Path>();*
is incorrect, you should sort the on disk files by file length, not the uri
of Path.
So the files feed to OnDiskMerger is not sorted by length, hurt the overall

3) still in MergeManager.java being from line 745

      if (0 != onDiskBytes) {
      final int numInMemSegments = memDiskSegments.size();
      diskSegments.addAll(0, memDiskSegments);
      // Pass mergePhase only if there is a going to be intermediate
      // merges. See comment where mergePhaseFinished is being set
      Progress thisPhase = (mergePhaseFinished) ? null : mergePhase;
      RawKeyValueIterator diskMerge =* Merger.merge(
          job, fs, keyClass, valueClass, diskSegments,
          ioSortFactor, numInMemSegments, tmpDir, comparator,
          reporter, false, spilledRecordsCounter, null, thisPhase);*
      if (0 == finalSegments.size()) {
        return diskMerge;
      finalSegments.add(new Segment<K,V>(
            new RawKVIteratorReader(diskMerge, onDiskBytes), true));

the above bold code which will merge files down to io.sort.factor, maybe
have intermediate merge process .
What's wrong with that you didn't pass in the *codec *parameter, the
intermediate merge process will not compress
the written file on disk, leading to huge performance degrade.

4) the implementation of Reader in IFile.java is(maybe) not very efficient,
in previous hadoop version, each time read a buffer from the disk, no need
to hit disk for each key value. I don't known why rewrite this codes.

hopefully I don't miss anything.

Best Regards
Anty Rao

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message