So the number of mappers depends on a couple of factors (assuming Sqoop import on response):

1) The number of data nodes - Sqoop will take your -m# switch and send the generated .jar file to that same # of data nodes. So if you have 5 datanodes and you set your Sqoop execution to -m30 then you might overrun your Hadoop cluster. 

2) How many parallel SQL queries your source RDBMS can handle - Again sending a switch of -m30 may completely paralyze a source RDMS because of the concurrent load you are requesting. 

3) Skew of data by PK -  Sqoop will take the # of mappers (based on the -m switch) and divide it by the MIN and MAX of the --split by column (unless you do something special with the --boundary-query switch). So for example a -m4 my skew the data badly while even a small change to -m5 or -m6 may have the skew looking much better. You can test out the skew by running similar count against the source RDBMS or looking that the data files Sqoop creates. 

I hope that helps. 


On Jun 21, 2015, at 5:39 AM, sreejesh s <> wrote:


If there is a primary key on the source table, SQOOP import would generate no skewed data... What if there is no primary key defined on the table and we have to use --split-by parameter to split records among multiple mappers.
There are high chances of skewed data depending on the column we select to --split-by.
Could you please help me understand how to avoid skewing in such scenarios and also how to determine the optimal number of mappers to be used for any SQOOP import.

It helps if you can explain how many mappers you have used in your use case along with the size and format of data imported..