sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheolsoo Park (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SQOOP-603) Support small intervals in IntegerSplitter implementation
Date Thu, 20 Sep 2012 19:21:08 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459864#comment-13459864
] 

Cheolsoo Park commented on SQOOP-603:
-------------------------------------

Committed to trunk. Thanks Jarcec!

Note that while running unit tests, I discovered that the 3rd party LobAvro tests fail with
-Dhadoopversion=20. However, that's unrelated to this change, and I am going to open a jira
for them.

@Jarcec,
Can you please do git diff --no-prefix next time? In fact, you might want to define an alias
in .bashrc such as gitdiff="git diff.

Thanks!
                
> Support small intervals in IntegerSplitter implementation
> ---------------------------------------------------------
>
>                 Key: SQOOP-603
>                 URL: https://issues.apache.org/jira/browse/SQOOP-603
>             Project: Sqoop
>          Issue Type: Improvement
>    Affects Versions: 1.4.2
>            Reporter: Jarek Jarcec Cecho
>            Assignee: Jarek Jarcec Cecho
>             Fix For: 1.4.3
>
>         Attachments: SQOOP-603.patch
>
>
> IntegerSplitter is currently creating splits of following nature:
> {code}
> minimal value <= x < splitPoint1
> splitPoint1 <= x < splitPoint2
> ...
> splitPointN <= x <= maximal value
> {code}
> Please notice that upper bound is always with using condition "<" with exception of
the last split that is using condition "<=". This is perfectly fine when creating reasonable
amount of splits on very huge interval.
> This approach will however cause issues on very small intervals. For example following
splits will be created on interval [0, 5] with 5 splits:
> * 0 <= x < 1
> * 1 <= x < 2 
> * 2 <= x < 3 
> * 3 <= x < 4 
> * 4 <= x <= 5
> Notice that all splits have equal count of numbers except the last one having two numbers
- 4 and 5. This becomes very huge issue when for example user needs to create one split per
one partition as one mapper will end up with moving two partitions and thus entire job will
take twice as long as the other ones.
> Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message