From mapreduce-dev-return-8541-apmail-hadoop-mapreduce-dev-archive=hadoop.apache.org@hadoop.apache.org Thu Nov 8 16:59:54 2012 Return-Path: X-Original-To: apmail-hadoop-mapreduce-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 37649D1FA for ; Thu, 8 Nov 2012 16:59:54 +0000 (UTC) Received: (qmail 56001 invoked by uid 500); 8 Nov 2012 16:59:53 -0000 Delivered-To: apmail-hadoop-mapreduce-dev-archive@hadoop.apache.org Received: (qmail 55744 invoked by uid 500); 8 Nov 2012 16:59:50 -0000 Mailing-List: contact mapreduce-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-dev@hadoop.apache.org Delivered-To: mailing list mapreduce-dev@hadoop.apache.org Received: (qmail 55714 invoked by uid 99); 8 Nov 2012 16:59:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Nov 2012 16:59:49 +0000 X-ASF-Spam-Status: No, hits=0.4 required=5.0 tests=NO_RDNS_DOTCOM_HELO,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 216.145.54.172 is neither permitted nor denied by domain of evans@yahoo-inc.com) Received: from [216.145.54.172] (HELO mrout2.yahoo.com) (216.145.54.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Nov 2012 16:59:43 +0000 Received: from sp1-ex07cas01.ds.corp.yahoo.com (sp1-ex07cas01.ds.corp.yahoo.com [216.252.116.137]) by mrout2.yahoo.com (8.14.4/8.14.4/y.out) with ESMTP id qA8Gx7Uf072583 (version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=FAIL) for ; Thu, 8 Nov 2012 08:59:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=yahoo-inc.com; s=cobra; t=1352393948; bh=6RwAL0j7knPACv3PP9AGeRdfEcbLNgfJ12HmFo7ZkKo=; h=From:To:Date:Subject:Message-ID:In-Reply-To:Content-Type: Content-Transfer-Encoding:MIME-Version; b=QAPkSHAMlj3h+kWMVw52RY3+zr8SzuUZWJrSZMPSgz4LTezLkYu+k9h4kF3UwHXZb rD0xlB/yNnVp5nozDPw/hCt5pIFLn3u1HGOL5Khp033rHjcaH5+66vdBd5LjrPjMeb ILegRLJT28/1X3I0s74ulv0VbhNBqcfnpuhu1rqg= Received: from SP1-EX07VS02.ds.corp.yahoo.com ([216.252.116.135]) by sp1-ex07cas01.ds.corp.yahoo.com ([216.252.116.137]) with mapi; Thu, 8 Nov 2012 08:59:07 -0800 From: Robert Evans To: "mapreduce-dev@hadoop.apache.org" Date: Thu, 8 Nov 2012 08:59:04 -0800 Subject: Re: Shuffle phase: fine-grained control of data flow Thread-Topic: Shuffle phase: fine-grained control of data flow Thread-Index: Ac290l0NrLAOpTFkSp+tQrFQYAgTJg== Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.2.4.120824 acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Milter-Version: master.31+4-gbc07cd5+ X-CLX-ID: 393948001 X-Virus-Checked: Checked by ClamAV on apache.org Jiwei, Ok so you are specifically looking at reducing overall network bandwidth of skewed map outputs, not all map outputs. That would very much mean that #1 and #3 are off base. But as you point out it would only really be performance win if the data fits into memory. It seems like an interesting idea. If the goal is to reduce bandwidth and not improve individual job performance then it seems more plausible. Do you have a benchmark (grid mix run etc) that really taxes the network that you could use to measure the impact such a change would have? Something like this really needs some hard numbers for a proper evaluation. --Bobby Evans=20 On 11/7/12 11:32 PM, "Jiwei Li" wrote: >Hi Bobby, > >Thank you a lot for your suggestions. My whole idea is to minimize the >aggregate network bandwidth during Shuffle Phase, that is, to limit the >hops to minimum when transmitting data from map node to reduce node. >Usually, Partitioner creates skews that the JobTracker allocates different >amounts of map outputs to participating reduce nodes. Making reduce nodes >near map outputs with largest concerned partitions can reduce the >aggregate >network bandwidth. > >For #1, there is no need to schedule map tasks to be close to one another, >since it will only congest links among the cluster. For #2, the location >and size of each partition in each map output can be sent to JobTracker >along with the processing of InputSplit. Collecting enough such >information >(not necessarily waiting map tasks to finish), the JobTracker starts to >schedule reduce tasks to fetch map output data. #3 is the same as #1. > >Now the tricky part is that if all map outputs are spilled to disks, >network bandwidth may not be a bottleneck, because the time consumed in >disk seeks outnumbers that in data transmission. If map outputs fit in >memory, then network must be taken seriously. Also note that for evenly >distributed map outputs, current scheduling policy works just fine. > >Jiwei > > >On Wed, Nov 7, 2012 at 11:45 PM, Robert Evans wrote: > >> Jiwei, >> >> I think you could use that knowledge to launch reducers closer to the >>map >> output, but I am not sure that it would make much difference. It may >>even >> slow things down. It is a question of several things >> >> 1) Can we get enough map tasks close to one another that it will make a >> difference? >> 2) Does the reduced shuffle time offset the overhead of waiting for the >> map location data before launching and fetching data early? >> 3) and do the time savings also offset the overhead of getting the map >> tasks to be close to one another? >> >> For #2 you might be able to deal with this by using speculative >>execution, >> and launching some reduce tasks later if you see a clustering of map >> output. For #1 it will require changes to how we schedule tasks which >> depending on how well it is implemented will impact #3 as well. >> Additionally for #1 any job that approaches the same order of size as >>the >> cluster will almost require the map tasks to be evenly distributed >>around >> the cluster. If you can come up with a patch I would love to see some >> performance numbers. >> >> Personally I think spending time reducing the size of the data sent to >>the >> reducers is a much bigger win. Can you use a combiner? Do you really >>need >> all of the data or can you sample the data to get a statistically >> significant picture of what is in the data? Have you enabled >>compression >> between the maps and the reducers? >> >> --Bobby >> >> On 11/7/12 8:05 AM, "Harsh J" wrote: >> >> >Hi Jiwei, >> > >> >In trunk (i.e. MR2), the completion events selection + scheduling >> >logic lies under class EventFetcher's getMapCompletionEvents() method, >> >as viewable at >> > >>=20 >>http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project >>/ >>=20 >>>hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/a >>>pa >> >che/hadoop/mapreduce/task/reduce/EventFetcher.java?view=3Dmarkup >> > >> >This EventFetcher thread is used by the Shuffle (reduce package) >> >class, to continually do the shuffling. The Shuffle class is then >> >itself used by the ReduceTask class (look in mapred package of same >> >maven module). >> > >> >I guess you can start there, to see if a better selection+scheduling >> >logic would yield better results. >> > >> >On Wed, Nov 7, 2012 at 12:26 PM, Jiwei Li wrote: >> >> Dear all, >> >> >> >> For jobs like Sort, massive amounts of network traffic happen during >> >> shuffle phase. The simple mechanism in Hadoop 1.0.4 to choose reduce >> >>nodes >> >> does not help reduce network traffic. If JobTracker is fully aware of >> >> locations of every map output, why not take advantage of this >>topology >> >> knowledge? >> >> >> >> So, is there anyone who knows where to develop such codes upon? Many >> >>thanks. >> >> >> >> Regards. >> >> -- >> >> Jiwei >> > >> > >> > >> >-- >> >Harsh J >> >> > > >--=20 >Jiwei Li