sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Veena Basavaraj (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SQOOP-1771) Investigation FORMAT of the Array/NestedArray/ Set/ Map in Postgres and HIVE.
Date Fri, 21 Nov 2014 21:46:33 GMT

     [ https://issues.apache.org/jira/browse/SQOOP-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Veena Basavaraj updated SQOOP-1771:
-----------------------------------
    Description: 
update this wiki, which is missing details on the complex types

https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal

The above document does not explicitly say the design goals for choosing the IDF format for
different types but with conversation on of the related tickets  RB : https://reviews.apache.org/r/28139/diff/#.
Here are the considerations.

Intermediate Data Format is more relevant when we transfer data between the FROM and TO and
both do not agree on the same form of data as it is transferred.

The IDF API as of today exposes 3 types of setter, one for a generic type T, one for Text/String,
one for object array.  
{code}
  /**
   * Set one row of data. If validate is set to true, the data is validated
   * against the schema.
   *
   * @param data - A single row of data to be moved.
   */
  public void setData(T data) {
    this.data = data;
  }

  /**
   * Get one row of data.
   *
   * @return - One row of data, represented in the internal/native format of
   *         the intermediate data format implementation.
   */
  public T getData() {
    return data;
  }

  /**
   * Get one row of data as CSV.
   *
   * @return - String representing the data in CSV, according to the "FROM" schema.
   * No schema conversion is done on textData, to keep it as "high performance" option.
   */
  public abstract String getTextData();

  /**
   * Set one row of data as CSV.
   *
   */
  public abstract void setTextData(String text); 
{code} 

NOTE : the java docs are not completely accurate, there is really no validation happening:).
Second CSV in one way the IDF can be represented when it is TEXT.There can be other implementation
of CSV as well such as AVRO or JSON, very similar to the serDe interface in HIVE.

Anyways, so the design considerations seem to be the following

1. the setText/ getText are supposed to allow the FROM and TO to talk the same language and
hence should have very minimal transformations as the data flows through SQOOP. This means
that both FROM and TO agree to give data in the CSV IDF that is standardized in the wiki /
spec/ docs and the read data in the same format. Transformation may have to happen before
the setText() or after the getText, but nothing will happen in between when it flows through
sqoop.

2. The current proposal seems to recommend the formats that are more prominent with the databases
that ahve been explored in the list, but it is not really a complete set of all data sources/connectors
sqoop may have in future.
https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal

But overall the goal seem to be more on the side of sql dump and pg dump that use CSV format
and the hope is such transfers in sqoop will be more performant

3. Avoiding any CPU cycles, there is no validation that will done to make sure that the data
adheres to the CSV format. It is trust based system that the incoming data will follow the
CSV rules as depicted in the link above 
https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposa





  was:
update this wiki, which is missing details on the complex types

https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal


> Investigation FORMAT of the Array/NestedArray/ Set/ Map in Postgres and HIVE.
> -----------------------------------------------------------------------------
>
>                 Key: SQOOP-1771
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1771
>             Project: Sqoop
>          Issue Type: Sub-task
>          Components: sqoop2-framework
>            Reporter: Veena Basavaraj
>             Fix For: 1.99.5
>
>
> update this wiki, which is missing details on the complex types
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal
> The above document does not explicitly say the design goals for choosing the IDF format
for different types but with conversation on of the related tickets  RB : https://reviews.apache.org/r/28139/diff/#.
Here are the considerations.
> Intermediate Data Format is more relevant when we transfer data between the FROM and
TO and both do not agree on the same form of data as it is transferred.
> The IDF API as of today exposes 3 types of setter, one for a generic type T, one for
Text/String, one for object array.  
> {code}
>   /**
>    * Set one row of data. If validate is set to true, the data is validated
>    * against the schema.
>    *
>    * @param data - A single row of data to be moved.
>    */
>   public void setData(T data) {
>     this.data = data;
>   }
>   /**
>    * Get one row of data.
>    *
>    * @return - One row of data, represented in the internal/native format of
>    *         the intermediate data format implementation.
>    */
>   public T getData() {
>     return data;
>   }
>   /**
>    * Get one row of data as CSV.
>    *
>    * @return - String representing the data in CSV, according to the "FROM" schema.
>    * No schema conversion is done on textData, to keep it as "high performance" option.
>    */
>   public abstract String getTextData();
>   /**
>    * Set one row of data as CSV.
>    *
>    */
>   public abstract void setTextData(String text); 
> {code} 
> NOTE : the java docs are not completely accurate, there is really no validation happening:).
Second CSV in one way the IDF can be represented when it is TEXT.There can be other implementation
of CSV as well such as AVRO or JSON, very similar to the serDe interface in HIVE.
> Anyways, so the design considerations seem to be the following
> 1. the setText/ getText are supposed to allow the FROM and TO to talk the same language
and hence should have very minimal transformations as the data flows through SQOOP. This means
that both FROM and TO agree to give data in the CSV IDF that is standardized in the wiki /
spec/ docs and the read data in the same format. Transformation may have to happen before
the setText() or after the getText, but nothing will happen in between when it flows through
sqoop.
> 2. The current proposal seems to recommend the formats that are more prominent with the
databases that ahve been explored in the list, but it is not really a complete set of all
data sources/connectors sqoop may have in future.
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal
> But overall the goal seem to be more on the side of sql dump and pg dump that use CSV
format and the hope is such transfers in sqoop will be more performant
> 3. Avoiding any CPU cycles, there is no validation that will done to make sure that the
data adheres to the CSV format. It is trust based system that the incoming data will follow
the CSV rules as depicted in the link above 
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposa



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message