sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jarek Cecho" <jar...@apache.org>
Subject Re: Review Request 12936: SQOOP-777. Sqoop2: Pluggable Intermediate Data Format
Date Mon, 26 Aug 2013 23:55:10 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12936/#review25304
-----------------------------------------------------------


Hi Hari,
thank you again for working on this JIRA and please accept my apologies for the late response.
I've deep dived into the patch and I do have couple of comments and questions.


common/src/main/java/org/apache/sqoop/etl/io/DataReader.java
<https://reviews.apache.org/r/12936/#comment49646>

    Extreme nit: Read data from the execution framework...
    
    (not necessary MR)



common/src/main/java/org/apache/sqoop/etl/io/DataReader.java
<https://reviews.apache.org/r/12936/#comment49647>

    Extreme nit: Read data from execution framework...



common/src/main/java/org/apache/sqoop/etl/io/DataReader.java
<https://reviews.apache.org/r/12936/#comment49648>

    Nit: Read data from execution framework as a native format.
    
    This should be independent native format right?



common/src/main/java/org/apache/sqoop/etl/io/DataWriter.java
<https://reviews.apache.org/r/12936/#comment49649>

    Extreme nit: Write an array of objects into the execution framework...



common/src/main/java/org/apache/sqoop/etl/io/DataWriter.java
<https://reviews.apache.org/r/12936/#comment49650>

    Extreme nit: Write data into execution framework....



common/src/main/java/org/apache/sqoop/schema/type/Column.java
<https://reviews.apache.org/r/12936/#comment49651>

    Can we add descriptive java doc describing what exactly are expecting to validate?



connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcExportInitializer.java
<https://reviews.apache.org/r/12936/#comment50066>

    Those imports seems to be unused.



connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcExportInitializer.java
<https://reviews.apache.org/r/12936/#comment50067>

    We should also take into account exportJobConfiguration.table.schema when building the
query.



connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcImportInitializer.java
<https://reviews.apache.org/r/12936/#comment50068>

    Let's clean up unused imports after this refactoring.



connector/connector-generic-jdbc/src/test/java/org/apache/sqoop/connector/jdbc/TestExportLoader.java
<https://reviews.apache.org/r/12936/#comment49652>

    This method seems to be generated automatically, but it seems to be removing the fail()
call, is it intentional?



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment49656>

    The wiki [1] is also saying that the 0x5C should be substituted.
    
    1: https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment49658>

    Similarly as above, it seems that we are missing the \\ replacement for 0x5C.



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment50050>

    Is it wise to consume the exception here without any counters or other reporting except
of log message?
    
    It also seems that other parts of the code are not null safe, so this error handling will
most likely just cause NPE somewhere else.



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment49660>

    The setData() method is conditionally running the validations, are they intentionally
skipped here in setTextData()?



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment49661>

    I'm afraid that String.split() won't work correctly for following example input:
    
    1,"Hi, here is Jarcec"
    
    That contains two columns, but three columns will be created when splitting by comma.



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment50051>

    Nit: Please pass the correct array size instead of the zero.



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment50052>

    I would suggest to be strict here and accept only upper case variant of NULL.



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment50053>

    I would suggest to die quickly for unsupported data types rather than just silently pass
them as a un-escaped strings.



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment50055>

    Is there a reason why to do our own join rather than StringUtils.join()?



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment50056>

    Is expected and desirable that user can't reset schema to NULL?



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment50035>

    Nit: Would it make more sense to have IntermediateDataFormat (without type here) or CSVIntermediateDataFormat
since we are not allowing nothing else?



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment50036>

    This will convert the binary string to an array of 0 and 1, right? The intermediate data
format is suggesting the MySQLdump approach that is serializing the bytes as they are though.
The 0 and 1 seems to be quite inefficient.



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment50037>

    Since the regular expressions are pre-build and therefore the PatternSyntaxException should
not be thrown, is there any other exception that we are expecting?



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment50060>

    The conditional escaped argument seems quite dangerous to me - what if the user has saved
the data on disk with escaped = true and is using this class to read the data from disk? The
default value is false, so this will remove some actual data, right?



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment50038>

    Since the regular expressions are pre-build and therefore the PatternSyntaxException should
not be thrown, is there any other exception that we are expecting?



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/IntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment50058>

    Since couple of the methods, such as setSchema(), getSchema(), validateData() will have
to be implemented by every IDF in a very similar way, I'm wondering if it would make sense
to convert the IDF to abstract class and provide the base implementation?



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/IntermediateDataFormat.java
<https://reviews.apache.org/r/12936/#comment50039>

    Can we javadoc for those two methods?
    
    Also it seems that they are not used, so I'm wondering if we are planning to use them
in the future or if they are artifact from previous development?



core/src/main/java/org/apache/sqoop/framework/SubmissionRequest.java
<https://reviews.apache.org/r/12936/#comment50040>

    Can we make this a Class instead of String?



execution/mapreduce/src/main/java/org/apache/sqoop/execution/mapreduce/MapreduceExecutionEngine.java
<https://reviews.apache.org/r/12936/#comment50041>

    Nit: Can we keep the style from previous lines and put this on single line?



execution/mapreduce/src/main/java/org/apache/sqoop/job/io/SqoopWritable.java
<https://reviews.apache.org/r/12936/#comment50063>

    The length variable seems to be used only readFields method, so I'm wondering if local
variable wouldn't be better fit?



execution/mapreduce/src/main/java/org/apache/sqoop/job/io/SqoopWritable.java
<https://reviews.apache.org/r/12936/#comment50064>

    Casting the data to UTF might be quite dangerous, especially for binary values as they
will be interpreted and possibly corrupted. This won't be an issue with current implementation
that stores binary data as a stream of 0s and 1s, but will become an issue when (if) we switch
to the format designed on the wiki.



execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopFileOutputFormat.java
<https://reviews.apache.org/r/12936/#comment50070>

    This import do not seem to be relevant any more.



execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopMapper.java
<https://reviews.apache.org/r/12936/#comment50042>

    Nit: Please keep the line on a single line.



execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopMapper.java
<https://reviews.apache.org/r/12936/#comment50043>

    Nit: This change do not seem necessary.



execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopNullOutputFormat.java
<https://reviews.apache.org/r/12936/#comment50071>

    This import do not seem to be relevant any more.



execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopOutputFormatLoadExecutor.java
<https://reviews.apache.org/r/12936/#comment50044>

    Nit: Can we please put this on single line?



execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopReducer.java
<https://reviews.apache.org/r/12936/#comment50072>

    This import do not seem to be relevant any more.



execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopReducer.java
<https://reviews.apache.org/r/12936/#comment50045>

    Nit: Please put the line on a single line.



execution/mapreduce/src/test/java/org/apache/sqoop/job/JobUtils.java
<https://reviews.apache.org/r/12936/#comment50069>

    This import do not seem relevant any more.



execution/mapreduce/src/test/java/org/apache/sqoop/job/JobUtils.java
<https://reviews.apache.org/r/12936/#comment50046>

    Nit: Please put this on a single line.



execution/mapreduce/src/test/java/org/apache/sqoop/job/TestHdfsExtract.java
<https://reviews.apache.org/r/12936/#comment50047>

    It seems that we are building the same schema in multiple test cases. Would it make sense
to create a helper method for this rather than copy&pasting the code?



execution/mapreduce/src/test/java/org/apache/sqoop/job/io/SqoopWritableTest.java
<https://reviews.apache.org/r/12936/#comment50065>

    Hari got 20 experience points for using awesome text messages in the tests!



spi/pom.xml
<https://reviews.apache.org/r/12936/#comment50048>

    I'm not sure whether we want to make all connectors depending on the SDK. The SDK should
be light library of code for connectors, We should not force the connector to reuse anything.
Is the dependency required?



spi/src/main/java/org/apache/sqoop/connector/spi/SqoopConnector.java
<https://reviews.apache.org/r/12936/#comment50049>

    Nit: Please put this on a single line.


Jarcec

- Jarek Cecho


On Aug. 1, 2013, 3:41 a.m., Hari Shreedharan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/12936/
> -----------------------------------------------------------
> 
> (Updated Aug. 1, 2013, 3:41 a.m.)
> 
> 
> Review request for Sqoop.
> 
> 
> Bugs: SQOOP-777
>     https://issues.apache.org/jira/browse/SQOOP-777
> 
> 
> Repository: sqoop-sqoop2
> 
> 
> Description
> -------
> 
> Implemented a pluggable intermediate data format that decouples the internal representation
of the data from the connector and the output formats. Connectors can choose to implement
and support a format that is more efficient for them. Also separated the SqoopWritable so
that we can use the intermediate data format independent of (current) Hadoop. 
> 
> I ran a full build - all tests including integration tests pass. I have not added any
new tests, yet. I will add unit tests for the new classes. Also, I have not tried running
this on an actual cluster - so things may be broken. I'd like some initial feedback based
on the current patch. 
> 
> I also implemented escaping of characters. There is some work remaining to support binary
format, but it is mostly integration, the basic implementation is in place.
> 
> 
> Diffs
> -----
> 
>   common/pom.xml db11b5b 
>   common/src/main/java/org/apache/sqoop/etl/io/DataReader.java 3e1adc7 
>   common/src/main/java/org/apache/sqoop/etl/io/DataWriter.java d81364e 
>   common/src/main/java/org/apache/sqoop/schema/type/Column.java 8b630b2 
>   connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcConnector.java
e0da80f 
>   connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcExportInitializer.java
7212843 
>   connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcImportInitializer.java
96818ba 
>   connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/util/InitializationUtils.java
PRE-CREATION 
>   connector/connector-generic-jdbc/src/test/java/org/apache/sqoop/connector/jdbc/TestExportLoader.java
aa1c4ff 
>   connector/connector-generic-jdbc/src/test/java/org/apache/sqoop/connector/jdbc/TestImportExtractor.java
a7ed6ba 
>   connector/connector-sdk/pom.xml 4056e14 
>   connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java
PRE-CREATION 
>   connector/connector-sdk/src/main/java/org/apache/sqoop/connector/IntermediateDataFormat.java
PRE-CREATION 
>   connector/connector-sdk/src/test/java/org/apache/sqoop/connector/CSVIntermediateDataFormatTest.java
PRE-CREATION 
>   core/src/main/java/org/apache/sqoop/framework/JobManager.java d0a087d 
>   core/src/main/java/org/apache/sqoop/framework/SubmissionRequest.java 53d0039 
>   execution/mapreduce/pom.xml f9a2a0e 
>   execution/mapreduce/src/main/java/org/apache/sqoop/execution/mapreduce/MapreduceExecutionEngine.java
767080c 
>   execution/mapreduce/src/main/java/org/apache/sqoop/job/JobConstants.java 7fd9a01 
>   execution/mapreduce/src/main/java/org/apache/sqoop/job/etl/HdfsExportExtractor.java
1978ec6 
>   execution/mapreduce/src/main/java/org/apache/sqoop/job/etl/HdfsSequenceImportLoader.java
a07c511 
>   execution/mapreduce/src/main/java/org/apache/sqoop/job/etl/HdfsTextImportLoader.java
4621942 
>   execution/mapreduce/src/main/java/org/apache/sqoop/job/io/SqoopWritable.java PRE-CREATION

>   execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopFileOutputFormat.java
356ae8a 
>   execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopMapper.java 59cf391

>   execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopNullOutputFormat.java
90de6ef 
>   execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopOutputFormatLoadExecutor.java
739eb17 
>   execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopReducer.java b31161c

>   execution/mapreduce/src/test/java/org/apache/sqoop/job/JobUtils.java e21f15b 
>   execution/mapreduce/src/test/java/org/apache/sqoop/job/TestHdfsExtract.java b7079dd

>   execution/mapreduce/src/test/java/org/apache/sqoop/job/TestHdfsLoad.java f849aae 
>   execution/mapreduce/src/test/java/org/apache/sqoop/job/TestMapReduce.java 7b264c6 
>   execution/mapreduce/src/test/java/org/apache/sqoop/job/io/SqoopWritableTest.java PRE-CREATION

>   execution/mapreduce/src/test/java/org/apache/sqoop/job/mr/TestSqoopOutputFormatLoadExecutor.java
bee8ab7 
>   pom.xml 5ea0633 
>   spi/pom.xml 0b240e8 
>   spi/src/main/java/org/apache/sqoop/connector/spi/SqoopConnector.java 2becc56 
>   submission/mapreduce/src/main/java/org/apache/sqoop/submission/mapreduce/MapreduceSubmissionEngine.java
6fc485b 
> 
> Diff: https://reviews.apache.org/r/12936/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Hari Shreedharan
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message