hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Douglas <cdoug...@apache.org>
Subject Re: [DISCUSS] sort improvement
Date Sat, 22 Mar 2014 01:27:48 GMT
The sort implementation is pluggable (see MAPREDUCE-2454,
MAPREDUCE-4049), so please feel free to fork and improve it. Selecting
a sort implementation based on job configuration (e.g.,
BinaryComparable keys) would allow for much more efficient and
specialized implementations. -C

On Thu, Mar 20, 2014 at 8:40 PM, 蒋达晟 <dsjiang753@gmail.com> wrote:
> I'd like to discuss about some improvements of the sort stage in current
> hadoop. There are no formal documents available right now, so I will just
> hope someone would be interested in it. If so I can give more detail
> information.
>
>
>
> To start with, I have conducted a series of experiments on version 2.2 and
> I will take the example job Terasort as a demonstration of my thoughts.
>
>
>
> Machine: 19 slave machines, 12-core E5-2420(1.9GHZ), 7 disks
>
> Configuration: jvm=-Xmx512m, io.sort.mb=300, make sure no extra spill on
> both map/reduce side
>
> Datasize: 1TB (1e12), 4000 map/reduce
>
>
>
> Baseline: version 2.2
>
> Result: around 109000 CPU time
>
> (The terasort job is optimized because I just want to know the costage of
> the framework, anyone redo the test using the Terasort job in the example
> would see a surprisingly higher CPU time because of the counter system)
>
> Sort takes about 45% percentage of the total CPU time.
>
>
>
> Improvement 0+1
>
> Result: around 78000 CPU time
>
>
>
> Improvement 0+1+2
>
> Result: around 63000 CPU time
>
>
>
> The improvement 0 is use SSE to accelerate IFile checksum. Currently java
> implementation costs too much CPU, but I believe it worth another discuss,
> so I don't further extend it. Let me know if it worth doing.
>
>
>
> Improvement 1 uses a reduce side sort, which the intermediate result
> contain unsorted data, and reduce receive the data and sort it before the
> reduce function.
>
> The reason it saves some CPU is based on a fact that quick sort is
> efficient than heap-based merge sort. It saves comparison count, which in
> some circumstance is a direct reflection of the CPU cost.
>
> For large job with thousands of reduces, each map will send little data to
> reduce side. So the majority work of sort stage is at reduce side. In our
> case, each map sort approximately 62500 bytes use quick sort, but each
> reduce has to merge 4000 segments.
>
> There are several drawbacks of using a reduce side sort.
>
> First of all, it requires a rewrite of the whole data path. Although it is
> not much work to write the code, a few strategies in current implementation
> will be re-considered in new implementation.
>
> Secondary, combiner would be tricky to implement. One choice is to use a
> in-memory hash table, which needs a precise track of the memory usage of
> Java object. Another choice is to use current data path if combiner is
> required. But maintaining two sepearate data path seems to cause much extra
> work in future.
>
> I have tried to use in-memory hash table as a choice, when selectivity is
> high (the intermediate key number is much smaller than the input key
> number), it is feasible and has certain performance gain. But when
> selectivity is low, I don't see any performance advantage. Anyway, I
> believe there is a way to implement a combiner with lower cost, I just
> don't have much thought on it right now.
>
> Last drawback is about memory usage. We still need a large buffer as
> io.sort.mb indicates at map side to avoid a small spill file. And we also
> need a buffer at reduce side to sort data into a large segment. I don't
> think it is a big problem, just to complete my thought.
>
>
>
> Improvement 2 is about the sorting algorithm. I know some issues have been
> promoted to improve it, but there is still a lot of space for further
> improvement.
>
> First of all, current MapOutputBuffer implementation use a single buffer to
> store data and index. It is an adaptive strategy for both long record and
> short record. But it's not a good choice for performance. Implementation
> use direct int[] for the 16-bytes index would cut the CPU cost by half in
> my experiment.
>
> But even the implementation of QuickSort is optimized, it will still be
> much slower than RadixSort. On a 250MB terasort data, RadixSort can be
> 6.5x~10x faster in my test.
>
> One thing need to clarify is how to adapt RadixSort into current hadoop.
> The plain answer is that we can't replace QuickSort to RadixSort as a
> genral framework, but for the majority usage we can.
>
> The assumption here is that most of the key type we use in hadoop is
> IntWritable, Text or some other sub-classes of BinaryComparable. As long as
> we can get four bytes from the key share the same order of the original
> key, we can sort it with RadixSort and then use QuickSort on those equal
> four bytes. MAPREDUCE-1639<https://issues.apache.org/jira/browse/MAPREDUCE-1639>
> MAPREDUCE-3235 <https://issues.apache.org/jira/browse/MAPREDUCE-3235> has
> related discussion.
>
> Another important thing to point out is that if we don't combine
> improvement 2 with improvement 1, it will be less effective as the majority
> of the sort happens as reduce side using the low efficient merge sort.
>
>
>
> The combinition of the improvement 1 & 2 will make sort to be a
> non-significant part in the job. In the terasort job, among 63000s CPU time
> of our improved implementation, only 1500s is cost by the sort. A few other
> jobs from the HiBench also show less then 5% on the sort.
>
> It'a big saving with major rewrite. Please let me know if someone is
> interested in the idea.
>
> Thanks you!
> --
>
> *Dasheng Jiang*
>
> *Peking University, Beijing*
>
> *Email: dsjiang753@gmail.com <dsjiang753@gmail.com>    *
>
> *Phone: 18810775811*

Mime
View raw message