sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Venkat Ranganathan" <n....@live.com>
Subject Re: Review Request: Fix for SQOOP-1013
Date Sat, 01 Jun 2013 00:07:01 GMT


> On May 31, 2013, 8:30 p.m., Abraham Elmahrek wrote:
> > Seems good except for a few comments and nit picks.
> > 
> > The datetime split logic is when certain intervals are "hotter" than others. IE:
1000000 rows out of 20000000 exist between the date range of december 1st to december 31st,
but a user is importing the entire years data, with 12 node. Basically, 1 machine will extract
1000000 rows, while the others will extract 1000000/11 rows. In the future we could probably
add some upfront analysis of the data to improve distribution.
> 
> Abraham Elmahrek wrote:
>     Sorry... my wording above wasn't very concise... The datetime split logic doesn't
seem to handle "hot" partitions is what I meant.

Good point Abraham - even though this skew can happen to all datatypes that are used for splitting,
with dates, there is more chance of such skew.  Typically the solution for this as also to
efficiently manage the data life-cycle is dependent on the database facilities, so we may
have to do database specific functionality like partitioning etc.   A generic solution can
be attempted, but support for grouping and ordering based on expressions varies between different
databases.   But something to keep in our mind


> On May 31, 2013, 8:30 p.m., Abraham Elmahrek wrote:
> > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcImportPartitioner.java,
lines 164-180
> > <https://reviews.apache.org/r/11537/diff/2/?file=299088#file299088line164>
> >
> >     Nit: This seems like it can be merged above.

I am assuming you are talking about the end condition.   Please see above


> On May 31, 2013, 8:30 p.m., Abraham Elmahrek wrote:
> > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcImportPartitioner.java,
lines 139-163
> > <https://reviews.apache.org/r/11537/diff/2/?file=299088#file299088line139>
> >
> >     Nit: This loop can probably handle all partitions by using constructDateConditions(objLB,
objUB, false) to constructDateConditions(objLB, objUB, upperBound == maxDateValue). Also,
upperBound += (i < remainder) ? 1 : 0;

That is true, but we have the pattern to add the last partition separately for all partitioning
types.   It is better to be consistent, I thought.   Do you agree?


> On May 31, 2013, 8:30 p.m., Abraham Elmahrek wrote:
> > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcImportPartitioner.java,
line 190
> > <https://reviews.apache.org/r/11537/diff/2/?file=299088#file299088line190>
> >
> >     There doesn't seem to be any thing stopping partitionMaxValue from being null.
So, it makes sense to add some null checking here as well.

min and max being null is handled for all datatypes.   NULL in the presence of all values
will be treated as having a lower value and so it is impossible to get MAX to null while MIN
is not null.

In fact, when one row has a non-null value for a column with all the other rows in the table
having the split by column as NULL will return the single non-null value for both min and
max.

So, the whole null handling thing has to be cleaned up.   I will file a follow on JIRA for
that


> On May 31, 2013, 8:30 p.m., Abraham Elmahrek wrote:
> > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcImportPartitioner.java,
line 232
> > <https://reviews.apache.org/r/11537/diff/2/?file=299088#file299088line232>
> >
> >     There doesn't seem to be any thing stopping partitionMaxValue from being null.
So, it makes sense to add some null checking here as well.

Please see above


- Venkat


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11537/#review21237
-----------------------------------------------------------


On May 30, 2013, 6:25 p.m., Venkat Ranganathan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/11537/
> -----------------------------------------------------------
> 
> (Updated May 30, 2013, 6:25 p.m.)
> 
> 
> Review request for Sqoop.
> 
> 
> Description
> -------
> 
> This addresses Boolean, date, time, and timestamp splitters.
> 
> THis also disallows char type splitters as discussed in SQOOP-976
> 
> 
> Diffs
> -----
> 
>   connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcImportPartitioner.java
f80f30d 
>   connector/connector-generic-jdbc/src/test/java/org/apache/sqoop/connector/jdbc/TestImportPartitioner.java
ee314d0 
> 
> Diff: https://reviews.apache.org/r/11537/diff/
> 
> 
> Testing
> -------
> 
> Introduced new unit tests to test new functionality
> All tests pass
> 
> 
> Thanks,
> 
> Venkat Ranganathan
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message