sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Kincaid <kincaid.d...@gmail.com>
Subject Re: Strange distribution of keys among mappers
Date Wed, 19 Jun 2013 20:23:56 GMT
Thanks. We didn't specify the number of mappers, so it's giving us 4. I
understand your explanation, but it seems to conflict with the Sqoop user
guide (
http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
):

"When performing parallel imports, Sqoop needs a criterion by which it can
split the workload. Sqoop uses a *splitting column* to split the workload.
By default, Sqoop will identify the primary key column (if present) in a
table and use it as the splitting column. The low and high values for the
splitting column are retrieved from the database, and the map tasks operate
on evenly-sized components of the total range. For example, if you had a
table with a primary key column of id whose minimum value was 0 and maximum
value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four
processes which each execute SQL statements of the form SELECT * FROM
sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250,
500), (500, 750), and (750, 1001) in the different tasks."


On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <abe@cloudera.com> wrote:

> Hey David,
>
> Here's the algorithm:
> Split lengths are defined by (max - min)/(# mappers) and whatever is left
> is tacked on at the end. So in this case, (288272191-2110)/3 =
> 96090027.33... So I'm assuming the .33... is rounded down and split lengths
> will be of length 96090027. Sqoop will then create splits with the
> following points: (min) + (range length)*(n). We can see that 2110 + 96090027*0
> = 2110, 2110 + 96090027*1 = 96092137, 2110 + 96090027*2 = 192182164, and 2110
> + 96090027*3 = 288272191 will be generated based off of this algorithm.
> The last point to be added will be 288272192 because the max value is not
> part of the generated split points. Then sqoop will distributed accordingly
> based off of these points as you've pointed out above.
>
> Just to be sure, did you configure sqoop to use 3 mappers?
>
> Hope this helps,
> -Abe
>
>
> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <kincaid.dave@gmail.com>wrote:
>
>> We're seeing a strange thing happen with a sqoop import job with the way
>> the key range is getting distributed among the 4 mappers that are running.
>> The minimum key value is 2110 and the maximum value is 288272191. We are
>> getting one mapper that is only getting one record to import. Here is the
>> distribution among the mappers:
>>
>> [2110, 96092137)
>> [96092137, 192182164)
>> [192182164, 288272191)
>> [288272191, 288272192)
>>
>> you can see that the fourth mapper is given a range with only one value
>> in it. Could someone help me understand what is going on?
>>
>> Thanks,
>>
>> Dave
>>
>
>

Mime
View raw message