sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mariappan Asokan" <maso...@syncsort.com>
Subject Re: Review Request 22516: Support importing mainframe sequential datasets
Date Fri, 08 Aug 2014 17:46:41 GMT

> On July 10, 2014, 8:22 a.m., Venkat Ranganathan wrote:
> > src/java/org/apache/sqoop/manager/MainframeManager.java, line 75
> > <https://reviews.apache.org/r/22516/diff/1/?file=608148#file608148line75>
> >
> >     Is import into Hbase and Accumulo supported by this tool?  It looks like the
only target supported is HDFS text files from the command help.
> Mariappan Asokan wrote:
>     Each record in a mainframe dataset is treated as a single field (or column.)  So,
theoretically HBase, Accumulo, and Hive are supported but with limited usability.  So, I did
not add them in the documentation.  If you feel strongly that they should be documented, I
can work on that in the next version of the patch.
> Venkat Ranganathan wrote:
>     I feel it would be good to say we import only as text files and leave further processing,
loading into hive/hbase upto the user as the composition of the records and needed processing
differ and the schema can't be inferred.
> Mariappan Asokan wrote:
>     I agree with you.  To avoid confusion, I plan to remove support for parsing input
format, output format, hive, hbase, hcatalog, and codegen options.  This will synchronize
the document with the code. What do you think?
> Venkat Ranganathan wrote:
>     Sorry for the delay.   I was wondering whether the mainframe connector can just define
connector specific extra args and not create another tool.   Please see NetezzaManager or
DirectNetezzaManager as an example.   May be you have to invent a new synthetic  URI format
say jdbc:mfftp:<host address>:<port>/dataset and choose your Connection Manager
when --connect option with the above uri format is given.  That should simplify a whole lot
in my opinion.   What do you think?
> Mariappan Asokan wrote:
>     Thanks for your suggestions.  Sorry, I did not get back sooner.  In Sqoop 1.x, there
is a strong assumption that input source is always a database table.  Due to this the sqoop
import tool has many options that are relevant to a source database table.  A mainframe source
is totally different from a database table.  I think it is better to create a separate tool
for mainframe import rather than just a new connection manager.  The mainframe import tool
will not support many options that the database import tool supports.  It will have its own
options that the database import tool does not support.  At present, these are the host name
and partitioned dataset name.  In the future, the mainframe import tool may be enhanced with
metadata specific or connection specific arguments unique to mainframe.  Creating a synthetic
URI for a connection seems to be somewhat artificial to me.
>     Contrary to what I stated before, considering possible future enhancements, I think
it is better to retain the support for parsing input format, output format, Hive, HBase, HCatalog,
and codegen options.  The documentation will be enhanced in the future to reflect this support.
> Venkat Ranganathan wrote:
>     Thanks for your thoughts on the suggestion.  As you correctly pointed out, Sqoop
1.x has a JDBC model (that is why you had to implement  a ConnectionManager and provide pseudo
values for column types etc (always returning VARCHAR).   I understand there will be options
mainframe import will not support (much like there are mysql specific options or netezza or
sqlserver specific options).   I understand you want to have specific metadata for mainframe
import.  That may be tricky.   Connection specific arguments can be implemented as how JDBC
connection specific arguments are done.  
>     The reason for my suggestion was primarily to piggy back on the implementation for
imports into hive/hbase in future when you have the ability to provide specific metadata on
the data.
>     You can definitely parse the various options, but you have to explicitly check and
exit if the unsupported options are currently used.
>     My only worry with this tool is that this may be one off for mainframe imports alone
and we will be starting off with hdfs import only until you get to the rest of the parts and
when we finally see this, it is basically duplicating some of the code and may be difficult
to maintain,
> Gwen Shapira wrote:
>     I just checked the possibility of adding non-JDBC imports as part of the import tool,
using fake connection URL as you suggested.
>     This is not feasible - ConnManager (which you need to inherit) has to implement getConnection,
which returns java.sql.Connection. You can't return this connection object for an FTP. Same
for readTable which must return a ResultSet. 
>     I think a separate tool is the only way to go.
> Gwen Shapira wrote:
>     Never mind :)
>     I missed the fact that the Mainframe tool actually extends ConnManager anyways.

Thanks for all your comments.  I have listed the pros and cons of the a separate mainframe
import tool.  I would like to get the opinions of Sqoop committers and go with the decision
of majority.  If the decision is "no new import tool", I will make necessary changes in the
code and documentation and upload a new patch.


Mainframe source is entirely different from a database table.  Several of the database related
options (--boundary-query, --columns, --direct, --fetch-size, --inline-lob-limit, --null-string,
null-non-string, --query, --split-by, and --table) are not meaningful and will not be supported.
 It is easier for users to understand the documentation.  In the implementation, the
options are validated syntactically rather than semantically.  The mainframe host name can
be specified as an argument to the --connect options.  There is no synthetic JDBC type URI.
 Enhanced implementations of MainframeConnectionManager that support mainframe record layout
or special connection methods can add extra arguments after "--".


There is some code duplication in processing the options for import targets.

- Mariappan

This is an automatically generated e-mail. To reply, visit:

On June 14, 2014, 10:46 p.m., Mariappan Asokan wrote:
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22516/
> -----------------------------------------------------------
> (Updated June 14, 2014, 10:46 p.m.)
> Review request for Sqoop.
> Repository: sqoop-trunk
> Description
> -------
> This is to move mainframe datasets to Hadoop.
> Diffs
> -----
>   src/java/org/apache/sqoop/manager/MainframeManager.java PRE-CREATION 
>   src/java/org/apache/sqoop/mapreduce/MainframeDatasetFTPRecordReader.java PRE-CREATION

>   src/java/org/apache/sqoop/mapreduce/MainframeDatasetImportMapper.java PRE-CREATION

>   src/java/org/apache/sqoop/mapreduce/MainframeDatasetInputFormat.java PRE-CREATION 
>   src/java/org/apache/sqoop/mapreduce/MainframeDatasetInputSplit.java PRE-CREATION 
>   src/java/org/apache/sqoop/mapreduce/MainframeDatasetRecordReader.java PRE-CREATION

>   src/java/org/apache/sqoop/mapreduce/MainframeImportJob.java PRE-CREATION 
>   src/java/org/apache/sqoop/tool/MainframeImportTool.java PRE-CREATION 
>   src/java/org/apache/sqoop/tool/SqoopTool.java dbe429a 
>   src/java/org/apache/sqoop/util/MainframeFTPClientUtils.java PRE-CREATION 
>   src/test/org/apache/sqoop/manager/TestMainframeManager.java PRE-CREATION 
>   src/test/org/apache/sqoop/mapreduce/TestMainframeDatasetFTPRecordReader.java PRE-CREATION

>   src/test/org/apache/sqoop/mapreduce/TestMainframeDatasetInputFormat.java PRE-CREATION

>   src/test/org/apache/sqoop/mapreduce/TestMainframeDatasetInputSplit.java PRE-CREATION

>   src/test/org/apache/sqoop/mapreduce/TestMainframeImportJob.java PRE-CREATION 
>   src/test/org/apache/sqoop/tool/TestMainframeImportTool.java PRE-CREATION 
>   src/test/org/apache/sqoop/util/TestMainframeFTPClientUtils.java PRE-CREATION 
> Diff: https://reviews.apache.org/r/22516/diff/
> Testing
> -------
> Thanks,
> Mariappan Asokan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message