sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brett Medalen <bmeda...@hotmail.com>
Subject Re: Avoiding skew and determining optimal number of mappers in SQOOP import.
Date Sun, 21 Jun 2015 10:59:52 GMT
So the number of mappers depends on a couple of factors (assuming Sqoop import on response):

1) The number of data nodes - Sqoop will take your -m# switch and send the generated .jar
file to that same # of data nodes. So if you have 5 datanodes and you set your Sqoop execution
to -m30 then you might overrun your Hadoop cluster. 

2) How many parallel SQL queries your source RDBMS can handle - Again sending a switch of
-m30 may completely paralyze a source RDMS because of the concurrent load you are requesting.

3) Skew of data by PK -  Sqoop will take the # of mappers (based on the -m switch) and divide
it by the MIN and MAX of the --split by column (unless you do something special with the --boundary-query
switch). So for example a -m4 my skew the data badly while even a small change to -m5 or -m6
may have the skew looking much better. You can test out the skew by running similar count
against the source RDBMS or looking that the data files Sqoop creates. 

I hope that helps. 


> On Jun 21, 2015, at 5:39 AM, sreejesh s <sreejesh356@yahoo.com> wrote:
> Hi,
> If there is a primary key on the source table, SQOOP import would generate no skewed
data... What if there is no primary key defined on the table and we have to use --split-by
parameter to split records among multiple mappers.
> There are high chances of skewed data depending on the column we select to --split-by.
> Could you please help me understand how to avoid skewing in such scenarios and also how
to determine the optimal number of mappers to be used for any SQOOP import.
> It helps if you can explain how many mappers you have used in your use case along with
the size and format of data imported.. 
> Thanks

View raw message