sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Qian Xu" <qian.a...@intel.com>
Subject Re: Review Request 29346: Supporting DRY code in new IDF impementation JSONIDF
Date Thu, 25 Dec 2014 02:54:06 GMT


> On Dec. 24, 2014, 12:45 p.m., Qian Xu wrote:
> > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/JSONIntermediateDataFormat.java,
line 106
> > <https://reviews.apache.org/r/29346/diff/2/?file=799689#file799689line106>
> >
> >     I've a high level question: Is it necessary to keep a csvText internally in
IntermediateFormat. I have the feeling that any new ImtermediateFormat (e.g. JSON) must have
to deal with CSV.
> 
> Veena Basavaraj wrote:
>     please read the https://cwiki.apache.org/confluence/display/SQOOP/Intermediate+Data+Format+API
and the associated tickets in that wiki that give more context on how a new IDF other than
CSV is envisioned to be done 
>     
>     My knowledge is what I have learnt in those tickets and discussions
> 
> Veena Basavaraj wrote:
>     Hey Qian,
>     
>     I was thinking about this more yesterday and here is a high level question I have,
let me describe some details(notes rather) I have in my head. Take your time to read it through
and let me know if you have questions/ clarifications
>     
>     In case of CSVIDF, data is the CSV and hence we proactively construct it when object
array is set, for instance setObjectData(..) proactively constructs the csvText ( represented
by data) and everything else in LAZY, i.e in case of CSVIDF, object array is constructed lazily
from the csvText (i.e data). We can remove the data/ csvText and hold it one place, and I
actually now think this needs to be cleaned up in this RB itself, there is no point in holding
the same thing in 2 variables.
>     
>     In case of JSONIDF, data is JSONObject ( or any other IDF say AvroIDF), so what should
be the source of truth? Should it be the JSON object alone? i,e we only hold the JSONObject
and lazily construct both csv and Object array? or should csv always be constructed.
>     
>     So here are the two options
>     
>     Option #1 - Do not store anything but data
>     
>     1. Store the source of truth in data. 
>     2. When setData is called, store the JSON in data and nothing else, everything is
lazy
>     3. When setObjectData is called, construct the JSON object from it, do not store
any CSVText nor objectArray
>     4. When setCsvText is called, also contruct JSON object from it, do not store any
CSVText nor objectArray
>     5. when getObjectData is called, convert from JSON to objectArray, so this on demand
>     6. When getCSVText is called, convert from JSON to CSVText, so this on demand
>     7. When getData is called, return the JSON
>     Cons:
>     of course we wont store anything so not much in memory, but depending on how JSON
IDF will be used, i.e t FROM connector  uses JSONIDF and storing data in "JSON"
>     then the TO connector might be HDFS and want to use CSV, so in this case we on demand
do the CSV conversion for every single record. But it makes sense though
>     
>     Option #2 - store the data and the CSV, only so we know HDFS is one of the major
use cases and it might use CSV, but it is no longer true looking at the latest code in HDFS
connector, it does readText in some cases and readArray in some cases
>     1. Source of truth is JSON and CSV
>     2. When setData is called, store the JSON in data and also convert JSON to CSVText
and store CSVText
>     3. When setObjectData is called, construct CSV and store it the CSVText
>     4. When setCSVText is called, construct JSON, and store the csvText
>     5. When getObjectData is called use the CSVText and return objectArray, so we can
share the code in CSVIDF, and hence me moving this logic to base class is justified
>     6. When getCSVText is called, just return the stored CSVText, so this no-op
>     7. When getData is called, return the JSON
>     
>     
>     Lastly, why do we need to mandate CSV and object Array for all IDFs, it goes back
to how we have defined this API DataReader and DataWriter and any one of these can be invoked
by a connector code, which at times feels like a overly flexible to me, but not sure I want
to go there and debate that at this point:)
>     
>      /**
>        * Read data from the execution engine as an object array.
>        * @return - array of objects with each column represented as an object
>        * @throws Exception
>        */
>       public abstract Object[] readArrayRecord() throws Exception;
>     
>       /**
>        * Read data from execution engine as text - as a CSV record.
>        * public abstract Object readContent(int type) throws Exception;
>        * @return - CSV formatted data.
>        * @throws Exception
>        */
>       public abstract String readTextRecord() throws Exception;
>     
>       /**
>        * Read data from execution engine as a native format.
>        * @return - the content in the native format of the intermediate data
>        * format being used.
>        * @throws Exception
>        */
>       public abstract Object readContent() throws Exception;
> 
> Veena Basavaraj wrote:
>     obviously I chose to go with option #1 since option2 has no real evidence that storing
csv or been proactive about parsing from and out of CSV is a common use case.
>     thanks for your awesome reviews as always Qian.
>     
>     I am updating the patch soon.

I vote for option 1. For the classic use case "jdbc to hdfs", we will use CsvIDF, which will
not break anything. Jdbc will feed csv records. The CsvIDF will validate records and store
csv lazy. Hdfs will read csv records without any conversion as well (lazy).


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29346/#review66008
-----------------------------------------------------------


On Dec. 24, 2014, 4:32 a.m., Veena Basavaraj wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29346/
> -----------------------------------------------------------
> 
> (Updated Dec. 24, 2014, 4:32 a.m.)
> 
> 
> Review request for Sqoop.
> 
> 
> Bugs: SQOOP-1901
>     https://issues.apache.org/jira/browse/SQOOP-1901
> 
> 
> Repository: sqoop-sqoop2
> 
> 
> Description
> -------
> 
> see jIRA
> 
> 
> Diffs
> -----
> 
>   connector/connector-sdk/pom.xml 38c217a 
>   connector/connector-sdk/src/main/java/org/apache/sqoop/connector/common/SqoopIDFUtils.java
48adae1 
>   connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java
846aefd 
>   connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormatError.java
884550d 
>   connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/IntermediateDataFormat.java
7ac44dc 
>   connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/IntermediateDataFormatError.java
PRE-CREATION 
>   connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/JSONIntermediateDataFormat.java
PRE-CREATION 
>   connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/JSONIntermediateDataFormatError.java
PRE-CREATION 
>   connector/connector-sdk/src/test/java/org/apache/sqoop/connector/common/TestSqoopIDFUtils.java
e9108b0 
>   connector/connector-sdk/src/test/java/org/apache/sqoop/connector/idf/TestCSVIntermediateDataFormat.java
f6852a0 
>   connector/connector-sdk/src/test/java/org/apache/sqoop/connector/idf/TestJSONIntermediateDataFormat.java
PRE-CREATION 
>   pom.xml afdeb75 
> 
> Diff: https://reviews.apache.org/r/29346/diff/
> 
> 
> Testing
> -------
> 
> yes
> 
> 
> Thanks,
> 
> Veena Basavaraj
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message