hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wangxiaowei (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAPREDUCE-1939) split reduce compute phase ioto two threads one for reading and another for computing
Date Wed, 14 Jul 2010 03:03:01 GMT
split reduce compute phase ioto two threads one for reading and another for computing

                 Key: MAPREDUCE-1939
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1939
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
    Affects Versions: 0.20.2
            Reporter: wangxiaowei
             Fix For: 0.20.2

it is known that  reduce task is made up of three phases: shuffle , sort and reduce. During
reduce phase, a reduce function will read a record from disk or memory first and process it
to write to hdfs finally. To convert this serial progress to parallel progress , I split the
reduce phase into two threads called producer and consumer individually. producer is used
to read record from disk and consumer to process the records read by the first one. I use
two buffer, if  producer is writing one buffer consumer will read from another buffer.  Theoretically
 there will be a overlap between this two phases so we can reduce the whole reduce time.

I wonder why hadoop does not implement it originally? Is there some potential problems for
such ideas ?

I have already implemmented a prototypy. The producer just reads bytes from the disk and leaves
the work of transformation to real key and value objects to consumer. The results is not good
only a improvement of 13%  for time. I think it has someting with the buffer size and the
time spending on different threads.Maybe the tiem spend by consumer thread is too long and
the producer has to wait until the next buffer is available.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message