hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 易剑 <myhad...@gmail.com>
Subject The idea to enhance MapReduce to resolve the skew problem
Date Thu, 04 Feb 2010 08:10:43 GMT
Currently, only map tasks are balanced, and reduce tasks possible are skew,
the timeslice is also different, which lead the scheduler is not smart. I
have an idea to improve it.

We can break the output of map to N*M splits, N is the number of nodes, and
M >=1,and regroup to new splits bycombining the smaller splits and
resplitting the bigger splits, until the size of every splits is balanced
with the specified value.

There are three cases:
1. Too many values for a key
2. Too many keys hash to a partition
3. Every partition is balanced in the size

If too many values for a key, adding a new MapReduce procedure is necessary.
If too many keys hash to a partition, resplitting is necessary.

If every splitting is balanced, we can consider a task (map or reduce) to a
scheduler timeslice, the scheduler will be smart like OS's scheduler.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message