sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Kincaid <kincaid.d...@gmail.com>
Subject Re: Strange distribution of keys among mappers
Date Thu, 20 Jun 2013 00:33:37 GMT
Right. That seems to be what's happening. Thank you for all the help
understanding. It's making sense now.

- Dave


On Wed, Jun 19, 2013 at 7:30 PM, Abraham Elmahrek <abe@cloudera.com> wrote:

> David,
>
> It's really just a hint. So the splitters will try to hit whatever is
> defined, but an extra may be created. For instance, BigDecimalSplitter will
> create 4 splits for certain ranges with 3 MR tasks specified.
>
> -Abe
>
>
> On Wed, Jun 19, 2013 at 5:03 PM, David Kincaid <kincaid.dave@gmail.com>wrote:
>
>> We don't have that set on our cluster and aren't specifying it in our
>> job. When I look at the different sqoop jobs I see both 3 for some and 4
>> for others on the jobs.
>>
>>
>> On Wed, Jun 19, 2013 at 6:50 PM, Abraham Elmahrek <abe@cloudera.com>wrote:
>>
>>> David,
>>>
>>> Well I think sqoop is looking at "mapred.map.tasks". Do you have that
>>> set in mapred-site.xml? I thought that defaults to 2.
>>>
>>> -Abe
>>>
>>>
>>> On Wed, Jun 19, 2013 at 4:31 PM, Abraham Elmahrek <abe@cloudera.com>wrote:
>>>
>>>> David,
>>>>
>>>> I've created https://issues.apache.org/jira/browse/SQOOP-1093 to track
>>>> the documentation issue. Thanks for bringing this to the community's
>>>> attention!
>>>>
>>>> -Abe
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 4:21 PM, Abraham Elmahrek <abe@cloudera.com>wrote:
>>>>
>>>>> Hey David,
>>>>>
>>>>> With oracle, the BigDecimalSplitter will be used by default for all
>>>>> number types.
>>>>>
>>>>> -Abe
>>>>>
>>>>>
>>>>> On Wed, Jun 19, 2013 at 4:05 PM, David Kincaid <kincaid.dave@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Abe, the database is Oracle.
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 19, 2013 at 5:48 PM, Abraham Elmahrek <abe@cloudera.com>wrote:
>>>>>>
>>>>>>> David,
>>>>>>>
>>>>>>> What database are you importing from? The description I gave
was for
>>>>>>> datatypes that map to the BigDecimal Splitter. The userguide
might be
>>>>>>> referring to the IntegerSplitter which will add the remainder
to the last
>>>>>>> value.
>>>>>>>
>>>>>>> -Abe
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid <
>>>>>>> kincaid.dave@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks. We didn't specify the number of mappers, so it's
giving us
>>>>>>>> 4. I understand your explanation, but it seems to conflict
with the Sqoop
>>>>>>>> user guide (
>>>>>>>> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
>>>>>>>> ):
>>>>>>>>
>>>>>>>> "When performing parallel imports, Sqoop needs a criterion
by
>>>>>>>> which it can split the workload. Sqoop uses a *splitting
column* to
>>>>>>>> split the workload. By default, Sqoop will identify the primary
key column
>>>>>>>> (if present) in a table and use it as the splitting column.
The low and
>>>>>>>> high values for the splitting column are retrieved from the
database, and
>>>>>>>> the map tasks operate on evenly-sized components of the total
range. For
>>>>>>>> example, if you had a table with a primary key column of
id whose
>>>>>>>> minimum value was 0 and maximum value was 1000, and Sqoop
was directed to
>>>>>>>> use 4 tasks, Sqoop would run four processes which each execute
SQL
>>>>>>>> statements of the form SELECT * FROM sometable WHERE id >=
lo AND
>>>>>>>> id < hi, with (lo, hi) set to (0, 250), (250, 500), (500,
750),
>>>>>>>> and (750, 1001) in the different tasks."
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <abe@cloudera.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hey David,
>>>>>>>>>
>>>>>>>>> Here's the algorithm:
>>>>>>>>> Split lengths are defined by (max - min)/(# mappers)
and whatever
>>>>>>>>> is left is tacked on at the end. So in this case, (288272191-2110)/3
>>>>>>>>> = 96090027.33... So I'm assuming the .33... is rounded
down and split
>>>>>>>>> lengths will be of length 96090027. Sqoop will then create
splits
>>>>>>>>> with the following points: (min) + (range length)*(n).
We can see
>>>>>>>>> that 2110 + 96090027*0 = 2110, 2110 + 96090027*1 = 96092137,
2110
>>>>>>>>> + 96090027*2 = 192182164, and 2110 + 96090027*3 = 288272191
will
>>>>>>>>> be generated based off of this algorithm. The last point
to be added will
>>>>>>>>> be 288272192 because the max value is not part of the
generated
>>>>>>>>> split points. Then sqoop will distributed accordingly
based off of these
>>>>>>>>> points as you've pointed out above.
>>>>>>>>>
>>>>>>>>> Just to be sure, did you configure sqoop to use 3 mappers?
>>>>>>>>>
>>>>>>>>> Hope this helps,
>>>>>>>>> -Abe
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <
>>>>>>>>> kincaid.dave@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> We're seeing a strange thing happen with a sqoop
import job with
>>>>>>>>>> the way the key range is getting distributed among
the 4 mappers that are
>>>>>>>>>> running. The minimum key value is 2110 and the maximum
value is 288272191.
>>>>>>>>>> We are getting one mapper that is only getting one
record to import. Here
>>>>>>>>>> is the distribution among the mappers:
>>>>>>>>>>
>>>>>>>>>> [2110, 96092137)
>>>>>>>>>> [96092137, 192182164)
>>>>>>>>>> [192182164, 288272191)
>>>>>>>>>> [288272191, 288272192)
>>>>>>>>>>
>>>>>>>>>> you can see that the fourth mapper is given a range
with only one
>>>>>>>>>> value in it. Could someone help me understand what
is going on?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Dave
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message