Added: websites/staging/sqoop/trunk/content/docs/1.4.3/SqoopUserGuide.html ============================================================================== --- websites/staging/sqoop/trunk/content/docs/1.4.3/SqoopUserGuide.html (added) +++ websites/staging/sqoop/trunk/content/docs/1.4.3/SqoopUserGuide.html Fri Mar 8 16:45:37 2013 @@ -0,0 +1,2673 @@ +
Table of Contents
Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License.
Sqoop is a tool designed to transfer data between Hadoop and +relational databases. You can use Sqoop to import data from a +relational database management system (RDBMS) such as MySQL or Oracle +into the Hadoop Distributed File System (HDFS), +transform the data in Hadoop MapReduce, and then export the data back +into an RDBMS.
Sqoop automates most of this process, relying on the database to +describe the schema for the data to be imported. Sqoop uses MapReduce +to import and export the data, which provides parallel operation as +well as fault tolerance.
This document describes how to get started using Sqoop to move data +between databases and Hadoop and provides reference information for +the operation of the Sqoop command-line tool suite. This document is +intended for:
Sqoop is an open source software product of the Apache Software Foundation.
Software development for Sqoop occurs at http://sqoop.apache.org +At that site you can obtain:
The following prerequisite knowledge is required for this product:
Before you can use Sqoop, a release of Hadoop must be installed and +configured. Sqoop is currently supporting 4 major Hadoop releases - 0.20, +0.23, 1.0 and 2.0.
This document assumes you are using a Linux or Linux-like environment. +If you are using Windows, you may be able to use cygwin to accomplish +most of the following tasks. If you are using Mac OS X, you should see +few (if any) compatibility errors. Sqoop is predominantly operated and +tested on Linux.
With Sqoop, you can import data from a relational database system into +HDFS. The input to the import process is a database table. Sqoop +will read the table row-by-row into HDFS. The output of this import +process is a set of files containing a copy of the imported table. +The import process is performed in parallel. For this reason, the +output will be in multiple files. These files may be delimited text +files (for example, with commas or tabs separating each field), or +binary Avro or SequenceFiles containing serialized record data.
A by-product of the import process is a generated Java class which +can encapsulate one row of the imported table. This class is used +during the import process by Sqoop itself. The Java source code for +this class is also provided to you, for use in subsequent MapReduce +processing of the data. This class can serialize and deserialize data +to and from the SequenceFile format. It can also parse the +delimited-text form of a record. These abilities allow you to quickly +develop MapReduce applications that use the HDFS-stored records in +your processing pipeline. You are also free to parse the delimiteds +record data yourself, using any other tools you prefer.
After manipulating the imported records (for example, with MapReduce +or Hive) you may have a result data set which you can then export +back to the relational database. Sqoop’s export process will read +a set of delimited text files from HDFS in parallel, parse them into +records, and insert them as new rows in a target database table, for +consumption by external applications or users.
Sqoop includes some other commands which allow you to inspect the
+database you are working with. For example, you can list the available
+database schemas (with the
sqoop-list-databases tool) and tables
+within a schema (with the
sqoop-list-tables tool). Sqoop also
+includes a primitive SQL execution shell (the
Most aspects of the import, code generation, and export processes can +be customized. You can control the specific row range or columns imported. +You can specify particular delimiters and escape characters for the +file-based representation of the data, as well as the file format +used. You can also control the class or package names used in +generated code. Subsequent sections of this document explain how to +specify these and other arguments to Sqoop.
Sqoop is a collection of related tools. To use Sqoop, you specify the +tool you want to use and the arguments that control the tool.
If Sqoop is compiled from its own source, you can run Sqoop without a formal
+installation process by running the
bin/sqoop program. Users
+of a packaged deployment of Sqoop (such as an RPM shipped with Apache Bigtop)
+will see this program installed as
/usr/bin/sqoop. The remainder of this
+documentation will refer to this program as
sqoop. For example:
$ sqoop tool-name [tool-arguments]
The following examples that begin with a
Sqoop ships with a help tool. To display a list of all available +tools, type the following command:
$ sqoop help +usage: sqoop COMMAND [ARGS] + +Available commands: + codegen Generate code to interact with database records + create-hive-table Import a table definition into Hive + eval Evaluate a SQL statement and display the results + export Export an HDFS directory to a database table + help List available commands + import Import a table from a database to HDFS + import-all-tables Import tables from a database to HDFS + list-databases List available databases on a server + list-tables List available tables in a database + version Display version information + +See 'sqoop help COMMAND' for information on a specific command.
You can display help for a specific tool by entering:
+(tool-name); for example,
sqoop help import.
You can also add the
--help argument to any command:
In addition to typing the
sqoop (toolname) syntax, you can use alias
+scripts that specify the
sqoop-(toolname) syntax. For example, the
sqoop-export, etc. each select a specific
You invoke Sqoop through the program launch capability provided by
sqoop command-line program is a wrapper which runs the
bin/hadoop script shipped with Hadoop. If you have multiple
+installations of Hadoop present on your machine, you can select the
+Hadoop installation by setting the
$HADOOP_MAPRED_HOME environment variables.
$ HADOOP_COMMON_HOME=/path/to/some/hadoop \ + HADOOP_MAPRED_HOME=/path/to/some/hadoop-mapreduce \ + sqoop import --arguments...
$ export HADOOP_COMMON_HOME=/some/path/to/hadoop +$ export HADOOP_MAPRED_HOME=/some/path/to/hadoop-mapreduce +$ sqoop import --arguments...
If either of these variables are not set, Sqoop will fall back to
$HADOOP_HOME. If it is not set either, Sqoop will use the default
+installation locations for Apache Bigtop,
The active Hadoop configuration is loaded from
$HADOOP_CONF_DIR environment variable is set.
To control the operation of each Sqoop tool, you use generic and +specific arguments.
$ sqoop help import +usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS] + +Common arguments: + --connect <jdbc-uri> Specify JDBC connect string + --connect-manager <jdbc-uri> Specify connection manager class to use + --driver <class-name> Manually specify JDBC driver class to use + --hadoop-mapred-home <dir>+ Override $HADOOP_MAPRED_HOME + --help Print usage instructions +-P Read password from console + --password <password> Set authentication password + --username <username> Set authentication username + --verbose Print more information while working + --hadoop-home <dir>+ Deprecated. Override $HADOOP_HOME + +[...] + +Generic Hadoop command-line arguments: +(must preceed any tool-specific arguments) +Generic options supported are +-conf <configuration file> specify an application configuration file +-D <property=value> use value for given property +-fs <local|namenode:port> specify a namenode +-jt <local|jobtracker:port> specify a job tracker +-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster +-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. +-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. + +The general command line syntax is +bin/hadoop command [genericOptions] [commandOptions]
You must supply the generic arguments
-D, and so on after the
+tool name but before any tool-specific arguments (such as
--connect). Note that generic Hadoop arguments are preceeded by a
+single dash character (
-), whereas tool-specific arguments start
+with two dashes (
--), unless they are single character arguments such as
-jt arguments control the configuration
+and Hadoop server settings. For example, the
-D mapred.job.name=<job_name> can
+be used to set the name of the MR job that Sqoop launches, if not specified,
+the name defaults to the jar name for the job - which is derived from the used
-archives arguments are not typically used with
+Sqoop, but they are included as part of Hadoop’s internal argument-parsing
When using Sqoop, the command line options that do not change from +invocation to invocation can be put in an options file for convenience. +An options file is a text file where each line identifies an option in +the order that it appears otherwise on the command line. Option files +allow specifying a single option on multiple lines by using the +back-slash character at the end of intermediate lines. Also supported +are comments within option files that begin with the hash character. +Comments must be specified on a new line and may not be mixed with +option text. All comments and empty lines are ignored when option +files are expanded. Unless options appear as quoted strings, any +leading or trailing spaces are ignored. Quoted strings if used must +not extend beyond the line on which they are specified.
Option files can be specified anywhere in the command line as long as +the options within them follow the otherwise prescribed rules of +options ordering. For instance, regardless of where the options are +loaded from, they must follow the ordering such that generic options +appear first, tool specific options next, finally followed by options +that are intended to be passed to child programs.
To specify an options file, simply create an options file in a
+convenient location and pass it to the command line via
Whenever an options file is specified, it is expanded on the +command line before the tool is invoked. You can specify more than +one option files within the same invocation if needed.
For example, the following Sqoop invocation for import can +be specified alternatively as shown below:
$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST + +$ sqoop --options-file /users/homer/work/import.txt --table TEST
where the options file
/users/homer/work/import.txt contains the following:
import +--connect +jdbc:mysql://localhost/db +--username +foo
The options file can have empty lines and comments for readability purposes.
+So the above example would work exactly the same if the options file
/users/homer/work/import.txt contained the following:
# +# Options file for Sqoop import +# + +# Specifies the tool being invoked +import + +# Connect parameter and value +--connect +jdbc:mysql://localhost/db + +# Username parameter and value +--username +foo + +# +# Remaining options should be specified in the command line. +#
import tool imports an individual table from an RDBMS to HDFS.
+Each row from a table is represented as a separate record in HDFS.
+Records can be stored as text files (one record per line), or in
+binary representation as Avro or SequenceFiles.
$ sqoop import (generic-args) (import-args) +$ sqoop-import (generic-args) (import-args)
While the Hadoop generic arguments must precede any import arguments, +you can type the import arguments in any order with respect to one +another.
In this document, arguments are grouped into collections +organized by function. Some collections are present in several tools +(for example, the "common" arguments). An extended description of their +functionality is given only on the first presentation in this +document.
Table 1. Common arguments
|+ Argument +||+ Description +|
+ ||+ Specify JDBC connect string +|
+ ||+ Specify connection manager class to use +|
+ ||+ Manually specify JDBC driver class to use +|
+ ||+ Override $HADOOP_MAPRED_HOME +|
+ ||+ Print usage instructions +|
+ ||+ Read password from console +|
+ ||+ Set authentication password +|
+ ||+ Set authentication username +|
+ ||+ Print more information while working +|
+ ||+ Optional properties file that provides connection parameters +|
Sqoop is designed to import tables from a database into HDFS. To do
+so, you must specify a connect string that describes how to connect to the
+database. The connect string is similar to a URL, and is communicated to
+Sqoop with the
--connect argument. This describes the server and
+database to connect to; it may also specify the port. For example:
$ sqoop import --connect jdbc:mysql://database.example.com/employees
This string will connect to a MySQL database named
employees on the
database.example.com. It’s important that you do not use the URL
localhost if you intend to use Sqoop with a distributed Hadoop
+cluster. The connect string you supply will be used on TaskTracker nodes
+throughout your MapReduce cluster; if you specify the
localhost, each node will connect to a different
+database (or more likely, no database at all). Instead, you should use
+the full hostname or IP address of the database host that can be seen
+by all your remote nodes.
You might need to authenticate against the database before you can
+access it. You can use the
+to supply a username and a password to the database. For example:
$ sqoop import --connect jdbc:mysql://database.example.com/employees \ + --username aaron --password 12345
Sqoop automatically supports several databases, including MySQL. Connect
+strings beginning with
jdbc:mysql:// are handled automatically in Sqoop. (A
+full list of databases with built-in support is provided in the "Supported
+Databases" section. For some, you may need to install the JDBC driver
You can use Sqoop with any other
+JDBC-compliant database. First, download the appropriate JDBC
+driver for the type of database you want to import, and install the .jar
+file in the
$SQOOP_HOME/lib directory on your client machine. (This will
/usr/lib/sqoop/lib if you installed from an RPM or Debian package.)
.jar file also has a specific driver class which defines
+the entry-point to the driver. For example, MySQL’s Connector/J library has
+a driver class of
com.mysql.jdbc.Driver. Refer to your database
+vendor-specific documentation to determine the main driver class.
+This class must be provided as an argument to Sqoop with
For example, to connect to a SQLServer database, first download the driver from +microsoft.com and install it in your Sqoop lib path.
Then run Sqoop. For example:
$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \ + --connect <connect-string> ...
When connecting to a database using JDBC, you can optionally specify extra
+JDBC parameters via a property file using the option
--connection-param-file. The contents of this file are parsed as standard
+Java properties and passed into the driver while creating a connection.
The parameters specified via the optional property file are only +applicable to JDBC connections. Any fastpath connectors that use connections +other than JDBC will ignore these parameters.
Table 2. Validation arguments More Details
|+ Argument +||+ Description +|
+ Enable validation of data copied, supports single table copy only. |
+ ||+ Specify validation threshold class to use. +|
|+ +--validation-failurehandler <class-name +||+ >+ Specify validation failure handler class to use. +|
Table 3. Import control arguments:
|+ Argument +||+ Description +|
+ ||+ Append data to an existing dataset in HDFS +|
+ ||+ Imports data to Avro Data Files +|
+ ||+ Imports data to SequenceFiles +|
+ ||+ Imports data as plain text (default) +|
+ ||+ Boundary query to use for creating splits +|
+ ||+ Columns to import from table +|
+ ||+ Use direct import fast path +|
+ ||+ Split the input stream every n bytes when importing in direct mode +|
+ ||+ Number of entries to read from database at once. +|
+ ||+ Set the maximum size for an inline LOB +|
+ ||+ Use n map tasks to import in parallel +|
+ Import the results of |
+ ||+ Column of the table used to split work units +|
+ ||+ Table to read +|
+ ||+ HDFS destination dir +|
+ ||+ HDFS parent for table destination +|
+ ||+ WHERE clause to use during import +|
+ ||+ Enable compression +|
+ ||+ Use Hadoop codec (default gzip) +|
+ ||+ The string to be written for a null value for string columns +|
+ ||+ The string to be written for a null value for non-string columns +|
--null-non-string arguments are optional.\
+If not specified, then the string "null" will be used.
Sqoop typically imports data in a table-centric fashion. Use the
--table argument to select the table to import. For example,
+employees. This argument can also identify a
VIEW or other table-like
+entity in a database.
By default, all columns within a table are selected for import. +Imported data is written to HDFS in its "natural order;" that is, a +table containing columns A, B, and C result in an import of data such +as:
A1,B1,C1 +A2,B2,C2 +...
You can select a subset of columns and control their ordering by using
--columns argument. This should include a comma-delimited list
+of columns to import. For example:
You can control which rows are imported by adding a SQL
+to the import statement. By default, Sqoop generates statements of the
SELECT <column list> FROM <table name>. You can append a
WHERE clause to this with the
--where argument. For example:
+"id > 400". Only rows where the
id column has a value greater than
+400 will be imported.
By default sqoop will use query
select min(<split-by>), max(<split-by>) from
+<table name> to find out boundaries for creating splits. In some cases this query
+is not the most optimal so you can specify any arbitrary query returning two
+numeric columns using
Sqoop can also import the result set of an arbitrary SQL query. Instead of
--where arguments, you can specify
+a SQL statement with the
When importing a free-form query, you must specify a destination directory
If you want to import the results of a query in parallel, then each map task
+will need to execute a copy of the query, with results partitioned by bounding
+conditions inferred by Sqoop. Your query must include the token
+which each Sqoop process will replace with a unique condition expression.
+You must also select a splitting column with
$ sqoop import \ + --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \ + --split-by a.id --target-dir /user/foo/joinresults
Alternately, the query can be executed once and imported serially, by
+specifying a single map task with
$ sqoop import \ + --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \ + -m 1 --target-dir /user/foo/joinresults
If you are issuing the query wrapped with double quotes ("),
+you will have to use
The facility of using free-form query in the current version of Sqoop
+is limited to simple queries where there are no ambiguous projections and
Sqoop imports data in parallel from most database sources. You can
+specify the number
+of map tasks (parallel processes) to use to perform the import by
--num-mappers argument. Each of these arguments
+takes an integer value which corresponds to the degree of parallelism
+to employ. By default, four tasks are used. Some databases may see
+improved performance by increasing this value to 8 or 16. Do not
+increase the degree of parallelism greater than that available within
+your MapReduce cluster; tasks will run serially and will likely
+increase the amount of time required to perform the import. Likewise,
+do not increase the degree of parallism higher than that which your
+database can reasonably support. Connecting 100 concurrent clients to
+your database may increase the load on the database server to a point
+where performance suffers as a result.
When performing parallel imports, Sqoop needs a criterion by which it
+can split the workload. Sqoop uses a splitting column to split the
+workload. By default, Sqoop will identify the primary key column (if
+present) in a table and use it as the splitting column. The low and
+high values for the splitting column are retrieved from the database,
+and the map tasks operate on evenly-sized components of the total
+range. For example, if you had a table with a primary key column of
id whose minimum value was 0 and maximum value was 1000, and Sqoop
+was directed to use 4 tasks, Sqoop would run four processes which each
+execute SQL statements of the form
SELECT * FROM sometable WHERE id
+>= lo AND id < hi, with
(lo, hi) set to (0, 250), (250, 500),
+(500, 750), and (750, 1001) in the different tasks.
If the actual values for the primary key are not uniformly distributed
+across its range, then this can result in unbalanced tasks. You should
+explicitly choose a different column with the
--split-by employee_id. Sqoop cannot currently split on
+multi-column indices. If your table has no index column, or has a
+multi-column key, then you must also manually choose a splitting
By default, the import process will use JDBC which provides a
+reasonable cross-vendor import channel. Some databases can perform
+imports in a more high-performance fashion by using database-specific
+data movement tools. For example, MySQL provides the
+which can export data from MySQL to other systems very quickly. By
--direct argument, you are specifying that Sqoop
+should attempt the direct import channel. This channel may be
+higher performance than using JDBC. Currently, direct mode does not
+support imports of large object columns.
When importing from PostgreSQL in conjunction with direct mode, you
+can split the import into separate files after
+individual files reach a certain size. This size limit is controlled
By default, Sqoop will import a table named
foo to a directory named
foo inside your home directory in HDFS. For example, if your
someuser, then the import tool will write to
/user/someuser/foo/(files). You can adjust the parent directory of
+the import with the
--warehouse-dir argument. For example:
$ sqoop import --connnect <connect-str> --table foo --warehouse-dir /shared \ + ...
This command would write to a set of files in the
You can also explicitly choose the target directory, like so:
$ sqoop import --connnect <connect-str> --table foo --target-dir /dest \ + ...
This will import the files into the
When using direct mode, you can specify additional arguments which
+should be passed to the underlying tool. If the argument
-- is given on the command-line, then subsequent arguments are sent
+directly to the underlying tool. For example, the following adjusts
+the character set used by
$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \ + --direct -- --default-character-set=latin1
By default, imports go to a new target location. If the destination directory
+already exists in HDFS, Sqoop will refuse to import and overwrite that
+directory’s contents. If you use the
--append argument, Sqoop will import
+data to a temporary directory and then rename the files into the normal
+target directory in a manner that does not conflict with existing filenames
+in that directory.
When using the direct mode of import, certain database client utilities
+are expected to be present in the shell path of the task process. For MySQL
Sqoop is preconfigured to map most SQL types to appropriate Java or Hive
+representatives. However the default mapping might not be suitable for
+everyone and might be overridden by
--map-column-java (for changing
+mapping to Java) or
--map-column-hive (for changing Hive mapping).
Table 4. Parameters for overriding mapping
|+ Argument +||+ Description +|
+ ||+ Override mapping from SQL to Java type for configured columns. +|
+ ||+ Override mapping from SQL to Hive type for configured columns. +|
Sqoop is expecting comma separated list of mapping in form <name of column>=<new type>. For example:
$ sqoop import ... --map-column-java id=String,value=Integer
Sqoop will rise exception in case that some configured mapping will not be used.
Sqoop provides an incremental import mode which can be used to retrieve +only rows newer than some previously-imported set of rows.
The following arguments control incremental imports:
Table 5. Incremental import arguments:
|+ Argument +||+ Description +|
+ ||+ Specifies the column to be examined when determining which rows to import. +|
+ Specifies how Sqoop determines which rows are new. Legal values for |
+ ||+ Specifies the maximum value of the check column from the previous import. +|
Sqoop supports two types of incremental imports:
+You can use the
--incremental argument to specify the type of incremental
+import to perform.
You should specify
append mode when importing a table where new rows are
+continually being added with increasing row id values. You specify the column
+containing the row’s id with
--check-column. Sqoop imports rows where the
+check column has a value greater than the one specified with
An alternate table update strategy supported by Sqoop is called
+mode. You should use this when rows of the source table may be updated, and
+each such update will set the value of a last-modified column to the current
+timestamp. Rows where the check column holds a timestamp more recent than the
+timestamp specified with
--last-value are imported.
At the end of an incremental import, the value which should be specified as
--last-value for a subsequent import is printed to the screen. When running
+a subsequent import, you should specify
--last-value in this way to ensure
+you import only the new or updated data. This is handled automatically by
+creating an incremental import as a saved job, which is the preferred
+mechanism for performing a recurring incremental import. See the section on
+saved jobs later in this document for more information.
You can import data in one of two file formats: delimited text or +SequenceFiles.
Delimited text is the default import format. You can also specify it
+explicitly by using the
--as-textfile argument. This argument will write
+string-based representations of each record to the output files, with
+delimiter characters between individual columns and rows. These
+delimiters may be commas, tabs, or other characters. (The delimiters
+can be selected; see "Output line formatting arguments.") The
+following is the results of an example text-based import:
1,here is a message,2010-05-01 +2,happy new year!,2010-01-01 +3,another message,2009-11-12
Delimited text is appropriate for most non-binary data types. It also +readily supports further manipulation by other tools, such as Hive.
SequenceFiles are a binary format that store individual records in
+custom record-specific data types. These data types are manifested as
+Java classes. Sqoop will automatically generate these data types for
+you. This format supports exact storage of all data in binary
+representations, and is appropriate for storing binary data
VARBINARY columns), or data that will be principly
+manipulated by custom MapReduce programs (reading from SequenceFiles
+is higher-performance than reading from text files, as records do not
+need to be parsed).
Avro data files are a compact, efficient binary format that provides +interoperability with applications written in other programming +languages. Avro also supports versioning, so that when, e.g., columns +are added or removed from a table, previously imported data files can +be processed along with new ones.
By default, data is not compressed. You can compress your data by
+using the deflate (gzip) algorithm with the
+argument, or specify any Hadoop compression codec using the
--compression-codec argument. This applies to SequenceFile, text,
+and Avro files.
Sqoop handles large objects (
CLOB columns) in particular
+ways. If this data is truly large, then these columns should not be
+fully materialized in memory for manipulation, as most columns are.
+Instead, their data is handled in a streaming fashion. Large objects
+can be stored inline with the rest of the data, in which case they are
+fully materialized in memory on every access, or they can be stored in
+a secondary storage file linked to the primary data storage. By
+default, large objects less than 16 MB in size are stored inline with
+the rest of the data. At a larger size, they are stored in files in
_lobs subdirectory of the import target directory. These files
+are stored in a separate format optimized for large record storage,
+which can accomodate records of up to 2^63 bytes each. The size at
+which lobs spill into separate files is controlled by the
--inline-lob-limit argument, which takes a parameter specifying the
+largest lob size to keep inline, in bytes. If you set the inline LOB
+limit to 0, all large objects will be placed in external
Table 6. Output line formatting arguments:
|+ Argument +||+ Description +|
+ ||+ Sets a required field enclosing character +|
+ ||+ Sets the escape character +|
+ ||+ Sets the field separator character +|
+ ||+ Sets the end-of-line character +|
+ Uses MySQL’s default delimiter set: fields: |
+ ||+ Sets a field enclosing character +|
When importing to delimited files, the choice of delimiter is
+important. Delimiters which appear inside string-based fields may
+cause ambiguous parsing of the imported data by subsequent analysis
+passes. For example, the string
"Hello, pleased to meet you" should
+not be imported with the end-of-field delimiter set to a comma.
Delimiters may be specified as:
--fields-terminated-by X) +
+an escape character (
--fields-terminated-by \t). Supported escape
+ characters are:
\r(carriage return) +
\0(NUL) - This will insert NUL characters between fields or lines, + or will disable enclosing/escaping if used for one of the
\0ooo, where ooo is the octal value. + For example,
--fields-terminated-by \001would yield the
\0xhhh, where hhh is the hex value. + For example,
--fields-terminated-by \0x10would yield the carriage + return character. +
The default delimiters are a comma (
,) for fields, a newline (
\n) for records, no quote
+character, and no escape character. Note that this can lead to
+ambiguous/unparsible records if you import database records containing
+commas or newlines in the field data. For unambiguous parsing, both must
+be enabled. For example, via
If unambiguous delimiters cannot be presented, then use enclosing and +escaping characters. The combination of (optional) +enclosing and escaping characters will allow unambiguous parsing of +lines. For example, suppose one column of a dataset contained the +following values:
Some string, with a comma. +Another "string with quotes"
The following arguments would provide delimiters which can be +unambiguously parsed:
$ sqoop import --fields-terminated-by , --escaped-by \\ --enclosed-by '\"' ...
(Note that to prevent the shell from mangling the enclosing character, +we have enclosed that argument itself in single-quotes.)
The result of the above arguments applied to the above dataset would +be:
"Some string, with a comma.","1","2","3"... +"Another \"string with quotes\"","4","5","6"...
Here the imported strings are shown in the context of additional
"1","2","3", etc.) to demonstrate the full effect of enclosing
+and escaping. The enclosing character is only strictly necessary when
+delimiter characters appear in the imported text. The enclosing
+character can therefore be specified as optional:
$ sqoop import --optionally-enclosed-by '\"' (the rest as above)...
Which would result in the following import:
"Some string, with a comma.",1,2,3... +"Another \"string with quotes\"",4,5,6...
Even though Hive supports escaping characters, it does not +handle escaping of new-line character. Also, it does not support +the notion of enclosing characters that may include field delimiters +in the enclosed string. It is therefore recommended that you choose +unambiguous field and record-terminating delimiters without the help +of escaping and enclosing characters when working with Hive; this is +due to limitations of Hive’s input parsing abilities.
--mysql-delimiters argument is a shorthand argument which uses
+the default delimiters for the
+If you use the
mysqldump delimiters in conjunction with a
+direct-mode import (with
--direct), very fast imports can be
While the choice of delimiters is most important for a text-mode
+import, it is still relevant if you import to SequenceFiles with
--as-sequencefile. The generated class'
+will use the delimiters you specify, so subsequent formatting of
+the output data will rely on the delimiters you choose.
Table 7. Input parsing arguments:
|+ Argument +||+ Description +|
+ ||+ Sets a required field encloser +|
+ ||+ Sets the input escape character +|
+ ||+ Sets the input field separator +|
+ ||+ Sets the input end-of-line character +|
+ ||+ Sets a field enclosing character +|
When Sqoop imports data to HDFS, it generates a Java class which can
+reinterpret the text files that it creates when doing a
+delimited-format import. The delimiters are chosen with arguments such
--fields-terminated-by; this controls both how the data is
+written to disk, and how the generated
parse() method reinterprets
+this data. The delimiters used by the
parse() method can be chosen
+independently of the output arguments, by using
--input-fields-terminated-by, and so on. This is useful, for example, to
+generate classes which can parse records created with one set of
+delimiters, and emit the records to a different set of files using a
+separate set of delimiters.
Table 8. Hive arguments:
|+ Argument +||+ Description +|
+ Override |
+ ||+ Import tables into Hive (Uses Hive’s default delimiters if none are set.) +|
+ ||+ Overwrite existing data in the Hive table. +|
+ ||+ If set, then the job will fail if the target hive +|
|+||+ table exits. By default this property is false. +|
+ ||+ Sets the table name to use when importing to Hive. +|
+ ||+ Drops \n, \r, and \01 from string fields when importing to Hive. [... 1905 lines stripped ...]|