hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sudhakara st <sudhakara...@gmail.com>
Subject Re: configuring number of mappers and reducers
Date Tue, 09 Apr 2013 15:43:33 GMT
Hi Samanesh,

Increasing the reducer for a job would not help as you excepting. In most
of MR jobs  more then 60% time will spent in mapper phase(it depends upon
what type of operation performing on data in map and reducer phase).

Increasing the number of reduces increases the framework overhead, but
increases load balancing, available map-reduce slots allocation, system
resource utilization by considering job processes requirement we can
optimize the jobs for best performance with lowers the cost of failures.

One more i cannot understand is why your so much worrying about response
time ?. The response time purely depends upon the how much data you are
processing in the job, what type of operation performing on the data, how
data distributed in the cluster and capacity of your cluster.  A MR job
should says it is optimized it  contains balanced number of mapper and
reducer.  As per normal MR applications like word count i suggest to mapper
and reducer ratio 4:1(if your jobs running without  combiner, In word count
like program with combiner defined, then i will suggest use 10:1 ) .

While tuning the MR jobs we cannot consider only response time as parameter
to optimize the job, there so many other factors need consider, and
response time not only depends on number of reducer we configure for job,
it depends on numerous other factors as mentioned above.



On Tue, Apr 9, 2013 at 2:05 PM, Samaneh Shokuhi
<samaneh.shokuhi@gmail.com>wrote:

> Thanks Sudhakara for your reply.
> I did my experminets by varing number of reducers and made it double in
> each experiments .I have a qustion regarding to the response time.Suppose
> there is 6 cluster nodes and in first experminet i have 3 reducers and it
> gets doubled (6 ) in second experiment  and in third one 12 .So what do we
> expect to see in response time ? Should it get changed approximately like
> T,T/2,T/4,.. ?!
> What i get as response time is not changed like that,  decreasion is like
> 2% or 3% .So i want to know by increasing the number of reducers how much
> decreasion normally we should get in response time ?
>
> Samaneh
>
>
> On Sun, Apr 7, 2013 at 7:53 PM, sudhakara st <sudhakara.st@gmail.com>
> wrote:
>
> > Hi Samanesh,
> >
> > You can experiment with
> > 1. By varying  number reducer(mapred.reduce.tasks)
> >
> > (Configure these parameters depends to you system capacity) .
> > mapred.tasktracker.map.tasks.maximum
> > mapred.tasktracker.reduce.tasks.maximum
> >
> > Tasktrackers have a fixed number of slots for map tasks and for reduce
> > tasks,The precise number depends on the number of cores and the amount of
> > memory on the tasktracker nodes, for example,a a quad- core with8GM
> memory
> > may be able to run 3 map tasks and 2 reduce tasks (not precise, it depend
> > what type job you are running) simultaneously.
> >
> >
> > The right number of reduces seems to be 0.95 or 1.75 * (nodes *
> > mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can launch
> > immediately and start transferring map outputs as the maps finish. At
> 1.75
> > the faster nodes will finish their first round of reduces and launch a
> > second round of reduces doing a much better job of load balancing.
> >
> > 2. These are some main job tuning factors in term cluster resource
> > utilization(CPU, memory,I/O, network) and response time.
> >    A)  io.sort.mb
> >          io.sort.record.percent
> >          io.sort.spill.percent
> >          io.sort.factor
> >           mapred.reduce.parallel.copies
> >
> >    B) Compression of Mapper and reducer outputs
> >         mapred.map.output.compression.codec
> >
> >     C)Enabling/Disabling   Speculative job execution
> >           mapred.map.tasks.speculative.execution.
> >           mapred.reduce.tasks.speculative.execution
> >
> >     D) Enabling JVM reuse
> >            mapred.job.reuse.jvm.num.tasks
> >
> >
> > On Sun, Apr 7, 2013 at 10:31 PM, Samaneh Shokuhi
> > <samaneh.shokuhi@gmail.com>wrote:
> >
> > > Thanks Sudhakara for your reply.
> > > So if number of mappers depends on the data size ,maybe the best way to
> > do
> > > my experiments is to increase the number of reducers based on the
> number
> > of
> > > estimated blocks in data file.Actually i want to know how response time
> > is
> > > changed by changing the number of mappers and reducers.
> > > Any idea about the way of  doing this kind of experiment?
> > >
> > > Samaneh
> > >
> > >
> > > On Sun, Apr 7, 2013 at 6:29 PM, sudhakara st <sudhakara.st@gmail.com>
> > > wrote:
> > >
> > > > Hi Samaneh,
> > > >
> > > >             The number of map tasks for a given job is driven by the
> > > number
> > > > of input splits in the input data. ideally in default configurations
> > >  each
> > > > input split(for a block) a map task is spawned. So your 2.5G of data
> > > > contains 44 blocks, therefore you jobs taking 44 map task. At
> minimum,
> > > with
> > > > FileInputFormat derivatives, job will have at least one map per file
> > and
> > > > can have multiple maps per file if they extend beyond a single
> > block(file
> > > > size is more that block size). The *mapred.map.tasks* parameter is
> > just a
> > > > hint to the InputFormat for the number of maps. its does not have any
> > > > effect if the number blocks in the input date more then specified
> > value.
> > > It
> > > > not possible to specify number mapper need run for a job. But it
> > possible
> > > > to explicitly specify  number reduce can run for a job by using *
> > > > mapred.reduce.tasks* property.
> > > >
> > > > The replication factor in not related in any to number of mapper and
> > > > reducer.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Apr 7, 2013 at 7:38 PM, Samaneh Shokuhi
> > > > <samaneh.shokuhi@gmail.com>wrote:
> > > >
> > > > > Hi All,
> > > > > I am doing some experiments by running WordCount example on hadoop.
> > > > > I have a cluster with 7 nodes .I want to run WordCount example with
> > > > > 3mappers and 3 reducers and compare the response time with another
> > > > > experiments when number of mappers and reducers increased to 6 and
> 12
> > > and
> > > > > so on.
> > > > > For first experiment i set number of the mappers and reducer to 3
> in
> > > > > wordCount example source code .and also set the number of
> > replications
> > > > to 3
> > > > > in hadoop configurations.Also  the maximum number of tasks per node
> > is
> > > > set
> > > > > to 1 .
> > > > > But when i run the sample with a big data like 2.5 G ,i can see 44
> > map
> > > > > tasks and 3 reduce tasks are running !!
> > > > >
> > > > > What parameters do i need to set to have like (3Mappers,3
> Reducers),
> > > > > (6M,6R) and (12M,12R) and as i mentioned i have a cluster with 1
> > > namenode
> > > > > and 6 datanodes.
> > > > > Is number of replications related to the number of mappers and
> > reducers
> > > > ?!
> > > > > Regards,
> > > > > Samaneh
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Regards,
> > > > .....  Sudhakara.st
> > > >
> > >
> >
> >
> >
> > --
> >
> > Regards,
> > .....  Sudhakara.st
> >
>



-- 

Regards,
.....  Sudhakara.st

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message