hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 蒋达晟 <dsjiang...@gmail.com>
Subject [DISCUSS] sort improvement
Date Fri, 21 Mar 2014 03:40:52 GMT
I'd like to discuss about some improvements of the sort stage in current
hadoop. There are no formal documents available right now, so I will just
hope someone would be interested in it. If so I can give more detail
information.



To start with, I have conducted a series of experiments on version 2.2 and
I will take the example job Terasort as a demonstration of my thoughts.



Machine: 19 slave machines, 12-core E5-2420(1.9GHZ), 7 disks

Configuration: jvm=-Xmx512m, io.sort.mb=300, make sure no extra spill on
both map/reduce side

Datasize: 1TB (1e12), 4000 map/reduce



Baseline: version 2.2

Result: around 109000 CPU time

(The terasort job is optimized because I just want to know the costage of
the framework, anyone redo the test using the Terasort job in the example
would see a surprisingly higher CPU time because of the counter system)

Sort takes about 45% percentage of the total CPU time.



Improvement 0+1

Result: around 78000 CPU time



Improvement 0+1+2

Result: around 63000 CPU time



The improvement 0 is use SSE to accelerate IFile checksum. Currently java
implementation costs too much CPU, but I believe it worth another discuss,
so I don't further extend it. Let me know if it worth doing.



Improvement 1 uses a reduce side sort, which the intermediate result
contain unsorted data, and reduce receive the data and sort it before the
reduce function.

The reason it saves some CPU is based on a fact that quick sort is
efficient than heap-based merge sort. It saves comparison count, which in
some circumstance is a direct reflection of the CPU cost.

For large job with thousands of reduces, each map will send little data to
reduce side. So the majority work of sort stage is at reduce side. In our
case, each map sort approximately 62500 bytes use quick sort, but each
reduce has to merge 4000 segments.

There are several drawbacks of using a reduce side sort.

First of all, it requires a rewrite of the whole data path. Although it is
not much work to write the code, a few strategies in current implementation
will be re-considered in new implementation.

Secondary, combiner would be tricky to implement. One choice is to use a
in-memory hash table, which needs a precise track of the memory usage of
Java object. Another choice is to use current data path if combiner is
required. But maintaining two sepearate data path seems to cause much extra
work in future.

I have tried to use in-memory hash table as a choice, when selectivity is
high (the intermediate key number is much smaller than the input key
number), it is feasible and has certain performance gain. But when
selectivity is low, I don't see any performance advantage. Anyway, I
believe there is a way to implement a combiner with lower cost, I just
don't have much thought on it right now.

Last drawback is about memory usage. We still need a large buffer as
io.sort.mb indicates at map side to avoid a small spill file. And we also
need a buffer at reduce side to sort data into a large segment. I don't
think it is a big problem, just to complete my thought.



Improvement 2 is about the sorting algorithm. I know some issues have been
promoted to improve it, but there is still a lot of space for further
improvement.

First of all, current MapOutputBuffer implementation use a single buffer to
store data and index. It is an adaptive strategy for both long record and
short record. But it's not a good choice for performance. Implementation
use direct int[] for the 16-bytes index would cut the CPU cost by half in
my experiment.

But even the implementation of QuickSort is optimized, it will still be
much slower than RadixSort. On a 250MB terasort data, RadixSort can be
6.5x~10x faster in my test.

One thing need to clarify is how to adapt RadixSort into current hadoop.
The plain answer is that we can't replace QuickSort to RadixSort as a
genral framework, but for the majority usage we can.

The assumption here is that most of the key type we use in hadoop is
IntWritable, Text or some other sub-classes of BinaryComparable. As long as
we can get four bytes from the key share the same order of the original
key, we can sort it with RadixSort and then use QuickSort on those equal
four bytes. MAPREDUCE-1639<https://issues.apache.org/jira/browse/MAPREDUCE-1639>
MAPREDUCE-3235 <https://issues.apache.org/jira/browse/MAPREDUCE-3235> has
related discussion.

Another important thing to point out is that if we don't combine
improvement 2 with improvement 1, it will be less effective as the majority
of the sort happens as reduce side using the low efficient merge sort.



The combinition of the improvement 1 & 2 will make sort to be a
non-significant part in the job. In the terasort job, among 63000s CPU time
of our improved implementation, only 1500s is cost by the sort. A few other
jobs from the HiBench also show less then 5% on the sort.

It'a big saving with major rewrite. Please let me know if someone is
interested in the idea.

Thanks you!
-- 

*Dasheng Jiang*

*Peking University, Beijing*

*Email: dsjiang753@gmail.com <dsjiang753@gmail.com>    *

*Phone: 18810775811*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message