sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Kincaid <kincaid.d...@gmail.com>
Subject Re: Strange distribution of keys among mappers
Date Thu, 20 Jun 2013 00:03:41 GMT
We don't have that set on our cluster and aren't specifying it in our job.
When I look at the different sqoop jobs I see both 3 for some and 4 for
others on the jobs.


On Wed, Jun 19, 2013 at 6:50 PM, Abraham Elmahrek <abe@cloudera.com> wrote:

> David,
>
> Well I think sqoop is looking at "mapred.map.tasks". Do you have that set
> in mapred-site.xml? I thought that defaults to 2.
>
> -Abe
>
>
> On Wed, Jun 19, 2013 at 4:31 PM, Abraham Elmahrek <abe@cloudera.com>wrote:
>
>> David,
>>
>> I've created https://issues.apache.org/jira/browse/SQOOP-1093 to track
>> the documentation issue. Thanks for bringing this to the community's
>> attention!
>>
>> -Abe
>>
>>
>> On Wed, Jun 19, 2013 at 4:21 PM, Abraham Elmahrek <abe@cloudera.com>wrote:
>>
>>> Hey David,
>>>
>>> With oracle, the BigDecimalSplitter will be used by default for all
>>> number types.
>>>
>>> -Abe
>>>
>>>
>>> On Wed, Jun 19, 2013 at 4:05 PM, David Kincaid <kincaid.dave@gmail.com>wrote:
>>>
>>>> Abe, the database is Oracle.
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 5:48 PM, Abraham Elmahrek <abe@cloudera.com>wrote:
>>>>
>>>>> David,
>>>>>
>>>>> What database are you importing from? The description I gave was for
>>>>> datatypes that map to the BigDecimal Splitter. The userguide might be
>>>>> referring to the IntegerSplitter which will add the remainder to the
last
>>>>> value.
>>>>>
>>>>> -Abe
>>>>>
>>>>>
>>>>> On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid <kincaid.dave@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Thanks. We didn't specify the number of mappers, so it's giving us
4.
>>>>>> I understand your explanation, but it seems to conflict with the
Sqoop user
>>>>>> guide (
>>>>>> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
>>>>>> ):
>>>>>>
>>>>>> "When performing parallel imports, Sqoop needs a criterion by which
>>>>>> it can split the workload. Sqoop uses a *splitting column* to split
>>>>>> the workload. By default, Sqoop will identify the primary key column
(if
>>>>>> present) in a table and use it as the splitting column. The low and
high
>>>>>> values for the splitting column are retrieved from the database,
and the
>>>>>> map tasks operate on evenly-sized components of the total range.
For
>>>>>> example, if you had a table with a primary key column of id whose
>>>>>> minimum value was 0 and maximum value was 1000, and Sqoop was directed
to
>>>>>> use 4 tasks, Sqoop would run four processes which each execute SQL
>>>>>> statements of the form SELECT * FROM sometable WHERE id >= lo
AND id
>>>>>> < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and
>>>>>> (750, 1001) in the different tasks."
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <abe@cloudera.com>wrote:
>>>>>>
>>>>>>> Hey David,
>>>>>>>
>>>>>>> Here's the algorithm:
>>>>>>> Split lengths are defined by (max - min)/(# mappers) and whatever
is
>>>>>>> left is tacked on at the end. So in this case, (288272191-2110)/3
=
>>>>>>> 96090027.33... So I'm assuming the .33... is rounded down and
split lengths
>>>>>>> will be of length 96090027. Sqoop will then create splits with
the
>>>>>>> following points: (min) + (range length)*(n). We can see that
2110
>>>>>>> + 96090027*0 = 2110, 2110 + 96090027*1 = 96092137, 2110 + 96090027*2
>>>>>>> = 192182164, and 2110 + 96090027*3 = 288272191 will be generated
>>>>>>> based off of this algorithm. The last point to be added will
be 288272192
>>>>>>> because the max value is not part of the generated split points.
Then sqoop
>>>>>>> will distributed accordingly based off of these points as you've
pointed
>>>>>>> out above.
>>>>>>>
>>>>>>> Just to be sure, did you configure sqoop to use 3 mappers?
>>>>>>>
>>>>>>> Hope this helps,
>>>>>>> -Abe
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <
>>>>>>> kincaid.dave@gmail.com> wrote:
>>>>>>>
>>>>>>>> We're seeing a strange thing happen with a sqoop import job
with
>>>>>>>> the way the key range is getting distributed among the 4
mappers that are
>>>>>>>> running. The minimum key value is 2110 and the maximum value
is 288272191.
>>>>>>>> We are getting one mapper that is only getting one record
to import. Here
>>>>>>>> is the distribution among the mappers:
>>>>>>>>
>>>>>>>> [2110, 96092137)
>>>>>>>> [96092137, 192182164)
>>>>>>>> [192182164, 288272191)
>>>>>>>> [288272191, 288272192)
>>>>>>>>
>>>>>>>> you can see that the fourth mapper is given a range with
only one
>>>>>>>> value in it. Could someone help me understand what is going
on?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Dave
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message