sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From abhijeet gaikwad <abygaikwa...@gmail.com>
Subject Re: Sqoop split-by column limiting map tasks
Date Fri, 31 Aug 2012 16:36:56 GMT
This is just a tweak for your scenario:
add this option to your sqoop command:
--boundary-query 'select min(mapid), max(mapid) + 1 from <table_name>'

Let me know if that doesn't work.

On 30 Aug 2012 21:43, "Erik Knoll" <erikknoll@gmail.com> wrote:

> I'm using Sqoop 1.4.1 to import a table from MySQL to HDFS. The table
> contains log entries by users who are identified by an integer user ID
> but does not have a primary key. Because of the way user ID's were
> assigned, lower value ID's have more records in the table than larger
> ID's making parallel imports extremely unbalanced (I'm only running 7
> map tasks).
> In order balance the parallel import, I created an additional column
> which I set to be 'mapid = UserID mod 7' producing values 0 - 6 which
> do have a uniform distribution of records. When I run the Sqoop import
> with '--split-by mapid -m 7' the job seems to be limited to 6 map
> tasks. This same behavior is exhibited even if I add 1 to my 'mapid'
> column so I'm thinking Sqoop is limiting the map tasks to the
> difference between the minimum and maximum values of the split-by
> column without adding 1 to the range.
> I know that I can create a different 'mapid' column or create a
> primary key, but is there something I can do with Sqoop to correct for
> this?
> Thank you,
> Erik

View raw message