hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandeep Paul <paultechn...@gmail.com>
Subject RE: HADOOP-1 Regarding dfs.block.size vs mapred.max.split.size
Date Thu, 11 Sep 2014 11:53:31 GMT
Thank you for clarification

-----Original Message-----
From: "Vinayakumar B" <vinayakumarb@apache.org>
Sent: ‎9/‎11/‎2014 4:49 PM
To: "mapreduce-dev@hadoop.apache.org" <mapreduce-dev@hadoop.apache.org>
Subject: Re: HADOOP-1 Regarding dfs.block.size vs mapred.max.split.size

Hi Sandeep,


1. "dfs.block.size" and "mapred.max.split.size" are related logically to
get the best performance in case of reading big files and to get the data

2. There is no strict rule in the framework for the max split size . You
can specify more than block size.

3. If the split size is more than the block size, then single map needs to
read multiple blocks. This block might be in some other node, which will
increase the I/O duration.

4. As I said before, you will loose the data locality gain, in case of
reading from multiple blocks which are located in different nodes.


On Thu, Sep 11, 2014 at 2:45 PM, sandeep paul <paultechneer@gmail.com>

> Hi ,
> I need confirmation regarding this two parameters and how they affect
> performance .
> I know(read) that always *mapred.max.split.size * should be less that
> *dfs.block.size,*
> But we always have an option of specifying  *mapred.max.split.size
> *greater
> than *dfs.block.size,*
> What will happen in that case will the FileInputFormat for calculating
> splits allows ?or it takes *dfs.block.size  *as the split size .
> Say if the framework allows then in that case one map-task will end up
> processing  more than one block (which will not be in local machine
> always),In that case how the performance Impact?.
> It would be a great help if anyone can help me get rid of this confusion.
> Thanks
> sandeep

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message