Added: incubator/sqoop/site/src/site/resources/docs/1.4.1-incubating/SqoopUserGuide.html URL: http://svn.apache.org/viewvc/incubator/sqoop/site/src/site/resources/docs/1.4.1-incubating/SqoopUserGuide.html?rev=1245310&view=auto ============================================================================== --- incubator/sqoop/site/src/site/resources/docs/1.4.1-incubating/SqoopUserGuide.html (added) +++ incubator/sqoop/site/src/site/resources/docs/1.4.1-incubating/SqoopUserGuide.html Fri Feb 17 03:19:16 2012 @@ -0,0 +1,2368 @@ +
Table of Contents
sqoop-import
sqoop-import-all-tables
sqoop-export
sqoop-job
sqoop-metastore
sqoop-merge
s
qoop-codegen
sqoop-create-hive-table
sqoop-eval
sqoop-list-databases
sqoop-list-tables
sqoop-help
sqoop-version
Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License.
Sqoop is a tool designed to transfer data between Hadoop and +relational databases. You can use Sqoop to import data from a +relational database management system (RDBMS) such as MySQL or Oracle +into the Hadoop Distributed File System (HDFS), +transform the data in Hadoop MapReduce, and then export the data back +into an RDBMS.
Sqoop automates most of this process, relying on the database to +describe the schema for the data to be imported. Sqoop uses MapReduce +to import and export the data, which provides parallel operation as +well as fault tolerance.
This document describes how to get started using Sqoop to move data +between databases and Hadoop and provides reference information for +the operation of the Sqoop command-line tool suite. This document is +intended for:
Sqoop is an open source software product of the Apache Software Foundation.
Software development for Sqoop occurs at http://svn.apache.org/repos/asf/incubator/sqoop/trunk. +At that site you can obtain:
Sqoop is compatible with Apache Hadoop 0.21 and Cloudera’s +Distribution of Hadoop version 3.
The following prerequisite knowledge is required for this product:
bash
+Before you can use Sqoop, a release of Hadoop must be installed and +configured. We recommend that you download Cloudera’s Distribution +for Hadoop (CDH3) from the Cloudera Software Archive at +http://archive.cloudera.com for straightforward installation of Hadoop +on Linux systems.
This document assumes you are using a Linux or Linux-like environment. +If you are using Windows, you may be able to use cygwin to accomplish +most of the following tasks. If you are using Mac OS X, you should see +few (if any) compatibility errors. Sqoop is predominantly operated and +tested on Linux.
With Sqoop, you can import data from a relational database system into +HDFS. The input to the import process is a database table. Sqoop +will read the table row-by-row into HDFS. The output of this import +process is a set of files containing a copy of the imported table. +The import process is performed in parallel. For this reason, the +output will be in multiple files. These files may be delimited text +files (for example, with commas or tabs separating each field), or +binary Avro or SequenceFiles containing serialized record data.
A by-product of the import process is a generated Java class which +can encapsulate one row of the imported table. This class is used +during the import process by Sqoop itself. The Java source code for +this class is also provided to you, for use in subsequent MapReduce +processing of the data. This class can serialize and deserialize data +to and from the SequenceFile format. It can also parse the +delimited-text form of a record. These abilities allow you to quickly +develop MapReduce applications that use the HDFS-stored records in +your processing pipeline. You are also free to parse the delimiteds +record data yourself, using any other tools you prefer.
After manipulating the imported records (for example, with MapReduce +or Hive) you may have a result data set which you can then export +back to the relational database. Sqoop’s export process will read +a set of delimited text files from HDFS in parallel, parse them into +records, and insert them as new rows in a target database table, for +consumption by external applications or users.
Sqoop includes some other commands which allow you to inspect the
+database you are working with. For example, you can list the available
+database schemas (with the sqoop-list-databases
tool) and tables
+within a schema (with the sqoop-list-tables
tool). Sqoop also
+includes a primitive SQL execution shell (the sqoop-eval
tool).
Most aspects of the import, code generation, and export processes can +be customized. You can control the specific row range or columns imported. +You can specify particular delimiters and escape characters for the +file-based representation of the data, as well as the file format +used. You can also control the class or package names used in +generated code. Subsequent sections of this document explain how to +specify these and other arguments to Sqoop.
Sqoop is a collection of related tools. To use Sqoop, you specify the +tool you want to use and the arguments that control the tool.
If Sqoop is compiled from its own source, you can run Sqoop without a formal
+installation process by running the bin/sqoop
program. Users
+of a packaged deployment of Sqoop (such as an RPM shipped with Cloudera’s
+Distribution for Hadoop) will see this program installed as /usr/bin/sqoop
.
+The remainder of this documentation will refer to this program as
+sqoop
. For example:
$ sqoop tool-name [tool-arguments]
![]() | Note |
---|---|
The following examples that begin with a |
Sqoop ships with a help tool. To display a list of all available +tools, type the following command:
$ sqoop help +usage: sqoop COMMAND [ARGS] + +Available commands: + codegen Generate code to interact with database records + create-hive-table Import a table definition into Hive + eval Evaluate a SQL statement and display the results + export Export an HDFS directory to a database table + help List available commands + import Import a table from a database to HDFS + import-all-tables Import tables from a database to HDFS + list-databases List available databases on a server + list-tables List available tables in a database + version Display version information + +See 'sqoop help COMMAND' for information on a specific command.
You can display help for a specific tool by entering: sqoop help
+(tool-name)
; for example, sqoop help import
.
You can also add the --help
argument to any command: sqoop import
+--help
.
In addition to typing the sqoop (toolname)
syntax, you can use alias
+scripts that specify the sqoop-(toolname)
syntax. For example, the
+scripts sqoop-import
, sqoop-export
, etc. each select a specific
+tool.
You invoke Sqoop through the program launch capability provided by
+Hadoop. The sqoop
command-line program is a wrapper which runs the
+bin/hadoop
script shipped with Hadoop. If you have multiple
+installations of Hadoop present on your machine, you can select the
+Hadoop installation by setting the $HADOOP_HOME
environment
+variable.
For example:
$ HADOOP_HOME=/path/to/some/hadoop sqoop import --arguments...
or:
$ export HADOOP_HOME=/some/path/to/hadoop +$ sqoop import --arguments...
If $HADOOP_HOME
is not set, Sqoop will use the default installation
+location for Cloudera’s Distribution for Hadoop, /usr/lib/hadoop
.
The active Hadoop configuration is loaded from $HADOOP_HOME/conf/
,
+unless the $HADOOP_CONF_DIR
environment variable is set.
To control the operation of each Sqoop tool, you use generic and +specific arguments.
For example:
$ sqoop help import +usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS] + +Common arguments: + --connect <jdbc-uri> Specify JDBC connect string + --connect-manager <jdbc-uri> Specify connection manager class to use + --driver <class-name> Manually specify JDBC driver class to use + --hadoop-home <dir> Override $HADOOP_HOME + --help Print usage instructions +-P Read password from console + --password <password> Set authentication password + --username <username> Set authentication username + --verbose Print more information while working + +[...] + +Generic Hadoop command-line arguments: +(must preceed any tool-specific arguments) +Generic options supported are +-conf <configuration file> specify an application configuration file +-D <property=value> use value for given property +-fs <local|namenode:port> specify a namenode +-jt <local|jobtracker:port> specify a job tracker +-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster +-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. +-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. + +The general command line syntax is +bin/hadoop command [genericOptions] [commandOptions]
You must supply the generic arguments -conf
, -D
, and so on after the
+tool name but before any tool-specific arguments (such as
+--connect
). Note that generic Hadoop arguments are preceeded by a
+single dash character (-
), whereas tool-specific arguments start
+with two dashes (--
), unless they are single character arguments such as -P
.
The -conf
, -D
, -fs
and -jt
arguments control the configuration
+and Hadoop server settings. For example, the -D mapred.job.name=<job_name>
can
+be used to set the name of the MR job that Sqoop launches, if not specified,
+the name defaults to the jar name for the job - which is derived from the used
+table name.
The -files
, -libjars
, and -archives
arguments are not typically used with
+Sqoop, but they are included as part of Hadoop’s internal argument-parsing
+system.
When using Sqoop, the command line options that do not change from +invocation to invocation can be put in an options file for convenience. +An options file is a text file where each line identifies an option in +the order that it appears otherwise on the command line. Option files +allow specifying a single option on multiple lines by using the +back-slash character at the end of intermediate lines. Also supported +are comments within option files that begin with the hash character. +Comments must be specified on a new line and may not be mixed with +option text. All comments and empty lines are ignored when option +files are expanded. Unless options appear as quoted strings, any +leading or trailing spaces are ignored. Quoted strings if used must +not extend beyond the line on which they are specified.
Option files can be specified anywhere in the command line as long as +the options within them follow the otherwise prescribed rules of +options ordering. For instance, regardless of where the options are +loaded from, they must follow the ordering such that generic options +appear first, tool specific options next, finally followed by options +that are intended to be passed to child programs.
To specify an options file, simply create an options file in a
+convenient location and pass it to the command line via
+--options-file
argument.
Whenever an options file is specified, it is expanded on the +command line before the tool is invoked. You can specify more than +one option files within the same invocation if needed.
For example, the following Sqoop invocation for import can +be specified alternatively as shown below:
$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST + +$ sqoop --options-file /users/homer/work/import.txt --table TEST
where the options file /users/homer/work/import.txt
contains the following:
import +--connect +jdbc:mysql://localhost/db +--username +foo
The options file can have empty lines and comments for readability purposes.
+So the above example would work exactly the same if the options file
+/users/homer/work/import.txt
contained the following:
# +# Options file for Sqoop import +# + +# Specifies the tool being invoked +import + +# Connect parameter and value +--connect +jdbc:mysql://localhost/db + +# Username parameter and value +--username +foo + +# +# Remaining options should be specified in the command line. +#
The i
mport
tool imports an individual table from an RDBMS to HDFS.
+Each row from a table is represented as a separate record in HDFS.
+Records can be stored as text files (one record per line), or in
+binary representation as Avro or SequenceFiles.
$ sqoop import (generic-args) (import-args) +$ sqoop-import (generic-args) (import-args)
While the Hadoop generic arguments must precede any import arguments, +you can type the import arguments in any order with respect to one +another.
![]() | Note |
---|---|
In this document, arguments are grouped into collections +organized by function. Some collections are present in several tools +(for example, the "common" arguments). An extended description of their +functionality is given only on the first presentation in this +document. |
Table 1. Common arguments
+ Argument + | + Description + |
---|---|
+ --connect <jdbc-uri>
+ | + Specify JDBC connect string + |
+ --connection-manager <class-name>
+ | + Specify connection manager class to use + |
+ --driver <class-name>
+ | + Manually specify JDBC driver class to use + |
+ --hadoop-home <dir>
+ | + Override $HADOOP_HOME + |
+ --help
+ | + Print usage instructions + |
+ -P
+ | + Read password from console + |
+ --password <password>
+ | + Set authentication password + |
+ --username <username>
+ | + Set authentication username + |
+ --verbose
+ | + Print more information while working + |
+ --connection-param-file <filename>
+ | + Optional properties file that provides connection parameters + |
Sqoop is designed to import tables from a database into HDFS. To do
+so, you must specify a connect string that describes how to connect to the
+database. The connect string is similar to a URL, and is communicated to
+Sqoop with the --connect
argument. This describes the server and
+database to connect to; it may also specify the port. For example:
$ sqoop import --connect jdbc:mysql://database.example.com/employees
This string will connect to a MySQL database named employees
on the
+host database.example.com
. It’s important that you do not use the URL
+localhost
if you intend to use Sqoop with a distributed Hadoop
+cluster. The connect string you supply will be used on TaskTracker nodes
+throughout your MapReduce cluster; if you specify the
+literal name localhost
, each node will connect to a different
+database (or more likely, no database at all). Instead, you should use
+the full hostname or IP address of the database host that can be seen
+by all your remote nodes.
You might need to authenticate against the database before you can
+access it. You can use the --username
and --password
or -P
parameters
+to supply a username and a password to the database. For example:
$ sqoop import --connect jdbc:mysql://database.example.com/employees \ + --username aaron --password 12345
![]() | Warning |
---|---|
The |
Sqoop automatically supports several databases, including MySQL. Connect
+strings beginning with jdbc:mysql://
are handled automatically in Sqoop. (A
+full list of databases with built-in support is provided in the "Supported
+Databases" section. For some, you may need to install the JDBC driver
+yourself.)
You can use Sqoop with any other
+JDBC-compliant database. First, download the appropriate JDBC
+driver for the type of database you want to import, and install the .jar
+file in the $SQOOP_HOME/lib
directory on your client machine. (This will
+be /usr/lib/sqoop/lib
if you installed from an RPM or Debian package.)
+Each driver .jar
file also has a specific driver class which defines
+the entry-point to the driver. For example, MySQL’s Connector/J library has
+a driver class of com.mysql.jdbc.Driver
. Refer to your database
+vendor-specific documentation to determine the main driver class.
+This class must be provided as an argument to Sqoop with --driver
.
For example, to connect to a SQLServer database, first download the driver from +microsoft.com and install it in your Sqoop lib path.
Then run Sqoop. For example:
$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \ + --connect <connect-string> ...
When connecting to a database using JDBC, you can optionally specify extra
+JDBC parameters via a property file using the option
+--connection-param-file
. The contents of this file are parsed as standard
+Java properties and passed into the driver while creating a connection.
![]() | Note |
---|---|
The parameters specified via the optional property file are only +applicable to JDBC connections. Any fastpath connectors that use connections +other than JDBC will ignore these parameters. |
Table 2. Import control arguments:
+ Argument + | + Description + |
---|---|
+ --append
+ | + Append data to an existing dataset in HDFS + |
+ --as-avrodatafile
+ | + Imports data to Avro Data Files + |
+ --as-sequencefile
+ | + Imports data to SequenceFiles + |
+ --as-textfile
+ | + Imports data as plain text (default) + |
+ --boundary-query <statement>
+ | + Boundary query to use for creating splits + |
+ --columns <col,col,col…>
+ | + Columns to import from table + |
+ --direct
+ | + Use direct import fast path + |
+ --direct-split-size <n>
+ | + Split the input stream every n bytes when importing in direct mode + |
+ --inline-lob-limit <n>
+ | + Set the maximum size for an inline LOB + |
+ -m,--num-mappers <n>
+ | + Use n map tasks to import in parallel + |
+ -e,--query <statement>
+ |
+ Import the results of statement .
+ |
+ --split-by <column-name>
+ | + Column of the table used to split work units + |
+ --table <table-name>
+ | + Table to read + |
+ --target-dir <dir>
+ | + HDFS destination dir + |
+ --warehouse-dir <dir>
+ | + HDFS parent for table destination + |
+ --where <where clause>
+ | + WHERE clause to use during import + |
+ -z,--compress
+ | + Enable compression + |
+ --compression-codec <c>
+ | + Use Hadoop codec (default gzip) + |
+ --null-string <null-string>
+ | + The string to be written for a null value for string columns + |
+ --null-non-string <null-string>
+ | + The string to be written for a null value for non-string columns + |
The --null-string
and --null-non-string
arguments are optional.\
+If not specified, then the string "null" will be used.
Sqoop typically imports data in a table-centric fashion. Use the
+--table
argument to select the table to import. For example, --table
+employees
. This argument can also identify a VIEW
or other table-like
+entity in a database.
By default, all columns within a table are selected for import. +Imported data is written to HDFS in its "natural order;" that is, a +table containing columns A, B, and C result in an import of data such +as:
A1,B1,C1 +A2,B2,C2 +...
You can select a subset of columns and control their ordering by using
+the --columns
argument. This should include a comma-delimited list
+of columns to import. For example: --columns "name,employee_id,jobtitle"
.
You can control which rows are imported by adding a SQL WHERE
clause
+to the import statement. By default, Sqoop generates statements of the
+form SELECT <column list> FROM <table name>
. You can append a
+WHERE
clause to this with the --where
argument. For example: --where
+"id > 400"
. Only rows where the id
column has a value greater than
+400 will be imported.
By default sqoop will use query select min(<split-by>), max(<split-by>) from
+<table name>
to find out boundaries for creating splits. In some cases this query
+is not the most optimal so you can specify any arbitrary query returning two
+numeric columns using --boundary-query
argument.
Sqoop can also import the result set of an arbitrary SQL query. Instead of
+using the --table
, --columns
and --where
arguments, you can specify
+a SQL statement with the --query
argument.
When importing a free-form query, you must specify a destination directory
+with --target-dir
.
If you want to import the results of a query in parallel, then each map task
+will need to execute a copy of the query, with results partitioned by bounding
+conditions inferred by Sqoop. Your query must include the token $CONDITIONS
+which each Sqoop process will replace with a unique condition expression.
+You must also select a splitting column with --split-by
.
For example:
$ sqoop import \ + --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \ + --split-by a.id --target-dir /user/foo/joinresults
Alternately, the query can be executed once and imported serially, by
+specifying a single map task with -m 1
:
$ sqoop import \ + --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \ + -m 1 --target-dir /user/foo/joinresults
![]() | Note |
---|---|
If you are issuing the query wrapped with double quotes ("),
+you will have to use |
![]() | Note |
---|---|
The facility of using free-form query in the current version of Sqoop
+is limited to simple queries where there are no ambiguous projections and
+no |
Sqoop imports data in parallel from most database sources. You can
+specify the number
+of map tasks (parallel processes) to use to perform the import by
+using the -m
or --num-mappers
argument. Each of these arguments
+takes an integer value which corresponds to the degree of parallelism
+to employ. By default, four tasks are used. Some databases may see
+improved performance by increasing this value to 8 or 16. Do not
+increase the degree of parallelism greater than that available within
+your MapReduce cluster; tasks will run serially and will likely
+increase the amount of time required to perform the import. Likewise,
+do not increase the degree of parallism higher than that which your
+database can reasonably support. Connecting 100 concurrent clients to
+your database may increase the load on the database server to a point
+where performance suffers as a result.
When performing parallel imports, Sqoop needs a criterion by which it
+can split the workload. Sqoop uses a splitting column to split the
+workload. By default, Sqoop will identify the primary key column (if
+present) in a table and use it as the splitting column. The low and
+high values for the splitting column are retrieved from the database,
+and the map tasks operate on evenly-sized components of the total
+range. For example, if you had a table with a primary key column of
+id
whose minimum value was 0 and maximum value was 1000, and Sqoop
+was directed to use 4 tasks, Sqoop would run four processes which each
+execute SQL statements of the form SELECT * FROM sometable WHERE id
+>= lo AND id < hi
, with (lo, hi)
set to (0, 250), (250, 500),
+(500, 750), and (750, 1001) in the different tasks.
If the actual values for the primary key are not uniformly distributed
+across its range, then this can result in unbalanced tasks. You should
+explicitly choose a different column with the --split-by
argument.
+For example, --split-by employee_id
. Sqoop cannot currently split on
+multi-column indices. If your table has no index column, or has a
+multi-column key, then you must also manually choose a splitting
+column.
By default, the import process will use JDBC which provides a
+reasonable cross-vendor import channel. Some databases can perform
+imports in a more high-performance fashion by using database-specific
+data movement tools. For example, MySQL provides the mysqldump
tool
+which can export data from MySQL to other systems very quickly. By
+supplying the --direct
argument, you are specifying that Sqoop
+should attempt the direct import channel. This channel may be
+higher performance than using JDBC. Currently, direct mode does not
+support imports of large object columns.
When importing from PostgreSQL in conjunction with direct mode, you
+can split the import into separate files after
+individual files reach a certain size. This size limit is controlled
+with the --direct-split-size
argument.
By default, Sqoop will import a table named foo
to a directory named
+foo
inside your home directory in HDFS. For example, if your
+username is someuser
, then the import tool will write to
+/user/someuser/foo/(files)
. You can adjust the parent directory of
+the import with the --warehouse-dir
argument. For example:
$ sqoop import --connnect <connect-str> --table foo --warehouse-dir /shared \ + ...
This command would write to a set of files in the /shared/foo/
directory.
You can also explicitly choose the target directory, like so:
$ sqoop import --connnect <connect-str> --table foo --target-dir /dest \ + ...
This will import the files into the /dest
directory. --target-dir
is
+incompatible with --warehouse-dir
.
When using direct mode, you can specify additional arguments which
+should be passed to the underlying tool. If the argument
+--
is given on the command-line, then subsequent arguments are sent
+directly to the underlying tool. For example, the following adjusts
+the character set used by mysqldump
:
$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \ + --direct -- --default-character-set=latin1
By default, imports go to a new target location. If the destination directory
+already exists in HDFS, Sqoop will refuse to import and overwrite that
+directory’s contents. If you use the --append
argument, Sqoop will import
+data to a temporary directory and then rename the files into the normal
+target directory in a manner that does not conflict with existing filenames
+in that directory.
![]() | Note |
---|---|
When using the direct mode of import, certain database client utilities
+are expected to be present in the shell path of the task process. For MySQL
+the utilities |
Sqoop is preconfigured to map most SQL types to appropriate Java or Hive
+representatives. However the default mapping might not be suitable for
+everyone and might be overridden by --map-column-java
(for changing
+mapping to Java) or --map-column-hive
(for changing Hive mapping).
Table 3. Parameters for overriding mapping
+ Argument + | + Description + |
---|---|
+ --map-column-java <mapping>
+ | + Override mapping from SQL to Java type for configured columns. + |
+ --map-column-hive <mapping>
+ | + Override mapping from SQL to Hive type for configured columns. + |
Sqoop is expecting comma separated list of mapping in form <name of column>=<new type>. For example:
$ sqoop import ... --map-column-java id=String,value=Integer
Sqoop will rise exception in case that some configured mapping will not be used.
Sqoop provides an incremental import mode which can be used to retrieve +only rows newer than some previously-imported set of rows.
The following arguments control incremental imports:
Table 4. Incremental import arguments:
+ Argument + | + Description + |
---|---|
+ --check-column (col)
+ | + Specifies the column to be examined when determining which rows to import. + |
+ --incremental (mode)
+ |
+ Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified .
+ |
+ --last-value (value)
+ | + Specifies the maximum value of the check column from the previous import. + |
Sqoop supports two types of incremental imports: append
and lastmodified
.
+You can use the --incremental
argument to specify the type of incremental
+import to perform.
You should specify append
mode when importing a table where new rows are
+continually being added with increasing row id values. You specify the column
+containing the row’s id with --check-column
. Sqoop imports rows where the
+check column has a value greater than the one specified with --last-value
.
An alternate table update strategy supported by Sqoop is called lastmodified
+mode. You should use this when rows of the source table may be updated, and
+each such update will set the value of a last-modified column to the current
+timestamp. Rows where the check column holds a timestamp more recent than the
+timestamp specified with --last-value
are imported.
At the end of an incremental import, the value which should be specified as
+--last-value
for a subsequent import is printed to the screen. When running
+a subsequent import, you should specify --last-value
in this way to ensure
+you import only the new or updated data. This is handled automatically by
+creating an incremental import as a saved job, which is the preferred
+mechanism for performing a recurring incremental import. See the section on
+saved jobs later in this document for more information.
You can import data in one of two file formats: delimited text or +SequenceFiles.
Delimited text is the default import format. You can also specify it
+explicitly by using the --as-textfile
argument. This argument will write
+string-based representations of each record to the output files, with
+delimiter characters between individual columns and rows. These
+delimiters may be commas, tabs, or other characters. (The delimiters
+can be selected; see "Output line formatting arguments.") The
+following is the results of an example text-based import:
1,here is a message,2010-05-01 +2,happy new year!,2010-01-01 +3,another message,2009-11-12
Delimited text is appropriate for most non-binary data types. It also +readily supports further manipulation by other tools, such as Hive.
SequenceFiles are a binary format that store individual records in
+custom record-specific data types. These data types are manifested as
+Java classes. Sqoop will automatically generate these data types for
+you. This format supports exact storage of all data in binary
+representations, and is appropriate for storing binary data
+(for example, VARBINARY
columns), or data that will be principly
+manipulated by custom MapReduce programs (reading from SequenceFiles
+is higher-performance than reading from text files, as records do not
+need to be parsed).
Avro data files are a compact, efficient binary format that provides +interoperability with applications written in other programming +languages. Avro also supports versioning, so that when, e.g., columns +are added or removed from a table, previously imported data files can +be processed along with new ones.
By default, data is not compressed. You can compress your data by
+using the deflate (gzip) algorithm with the -z
or --compress
+argument, or specify any Hadoop compression codec using the
+--compression-codec
argument. This applies to SequenceFile, text,
+and Avro files.
Sqoop handles large objects (BLOB
and CLOB
columns) in particular
+ways. If this data is truly large, then these columns should not be
+fully materialized in memory for manipulation, as most columns are.
+Instead, their data is handled in a streaming fashion. Large objects
+can be stored inline with the rest of the data, in which case they are
+fully materialized in memory on every access, or they can be stored in
+a secondary storage file linked to the primary data storage. By
+default, large objects less than 16 MB in size are stored inline with
+the rest of the data. At a larger size, they are stored in files in
+the _lobs
subdirectory of the import target directory. These files
+are stored in a separate format optimized for large record storage,
+which can accomodate records of up to 2^63 bytes each. The size at
+which lobs spill into separate files is controlled by the
+--inline-lob-limit
argument, which takes a parameter specifying the
+largest lob size to keep inline, in bytes. If you set the inline LOB
+limit to 0, all large objects will be placed in external
+storage.
Table 5. Output line formatting arguments:
+ Argument + | + Description + |
---|---|
+ --enclosed-by <char>
+ | + Sets a required field enclosing character + |
+ --escaped-by <char>
+ | + Sets the escape character + |
+ --fields-terminated-by <char>
+ | + Sets the field separator character + |
+ --lines-terminated-by <char>
+ | + Sets the end-of-line character + |
+ --mysql-delimiters
+ |
+ Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: '
+ |
+ --optionally-enclosed-by <char>
+ | + Sets a field enclosing character + |
When importing to delimited files, the choice of delimiter is
+important. Delimiters which appear inside string-based fields may
+cause ambiguous parsing of the imported data by subsequent analysis
+passes. For example, the string "Hello, pleased to meet you"
should
+not be imported with the end-of-field delimiter set to a comma.
Delimiters may be specified as:
--fields-terminated-by X
)
+
+an escape character (--fields-terminated-by \t
). Supported escape
+ characters are:
+
\b
(backspace)
+\n
(newline)
+\r
(carriage return)
+\t
(tab)
+\"
(double-quote)
+\\'
(single-quote)
+\\
(backslash)
+\0
(NUL) - This will insert NUL characters between fields or lines,
+ or will disable enclosing/escaping if used for one of the --enclosed-by
,
+ --optionally-enclosed-by
, or --escaped-by
arguments.
+\0ooo
, where ooo is the octal value.
+ For example, --fields-terminated-by \001
would yield the ^A
character.
+\0xhhh
, where hhh is the hex value.
+ For example, --fields-terminated-by \0x10
would yield the carriage
+ return character.
+The default delimiters are a comma (,
) for fields, a newline (\n
) for records, no quote
+character, and no escape character. Note that this can lead to
+ambiguous/unparsible records if you import database records containing
+commas or newlines in the field data. For unambiguous parsing, both must
+be enabled. For example, via --mysql-delimiters
.
If unambiguous delimiters cannot be presented, then use enclosing and +escaping characters. The combination of (optional) +enclosing and escaping characters will allow unambiguous parsing of +lines. For example, suppose one column of a dataset contained the +following values:
Some string, with a comma. +Another "string with quotes"
The following arguments would provide delimiters which can be +unambiguously parsed:
$ sqoop import --fields-terminated-by , --escaped-by \\ --enclosed-by '\"' ...
(Note that to prevent the shell from mangling the enclosing character, +we have enclosed that argument itself in single-quotes.)
The result of the above arguments applied to the above dataset would +be:
"Some string, with a comma.","1","2","3"... +"Another \"string with quotes\"","4","5","6"...
Here the imported strings are shown in the context of additional
+columns ("1","2","3"
, etc.) to demonstrate the full effect of enclosing
+and escaping. The enclosing character is only strictly necessary when
+delimiter characters appear in the imported text. The enclosing
+character can therefore be specified as optional:
$ sqoop import --optionally-enclosed-by '\"' (the rest as above)...
Which would result in the following import:
"Some string, with a comma.",1,2,3... +"Another \"string with quotes\"",4,5,6...
![]() | Note |
---|---|
Even though Hive supports escaping characters, it does not +handle escaping of new-line character. Also, it does not support +the notion of enclosing characters that may include field delimiters +in the enclosed string. It is therefore recommended that you choose +unambiguous field and record-terminating delimiters without the help +of escaping and enclosing characters when working with Hive; this is +due to limitations of Hive’s input parsing abilities. |
The --mysql-delimiters
argument is a shorthand argument which uses
+the default delimiters for the mysqldump
program.
+If you use the mysqldump
delimiters in conjunction with a
+direct-mode import (with --direct
), very fast imports can be
+achieved.
While the choice of delimiters is most important for a text-mode
+import, it is still relevant if you import to SequenceFiles with
+--as-sequencefile
. The generated class' toString()
method
+will use the delimiters you specify, so subsequent formatting of
+the output data will rely on the delimiters you choose.
Table 6. Input parsing arguments:
+ Argument + | + Description + |
---|---|
+ --input-enclosed-by <char>
+ | + Sets a required field encloser + |
+ --input-escaped-by <char>
+ | + Sets the input escape character + |
+ --input-fields-terminated-by <char>
+ | + Sets the input field separator + |
+ --input-lines-terminated-by <char>
+ | + Sets the input end-of-line character + |
+ --input-optionally-enclosed-by <char>
+ | + Sets a field enclosing character + |
When Sqoop imports data to HDFS, it generates a Java class which can
+reinterpret the text files that it creates when doing a
+delimited-format import. The delimiters are chosen with arguments such
+as --fields-terminated-by
; this controls both how the data is
+written to disk, and how the generated parse()
method reinterprets
+this data. The delimiters used by the parse()
method can be chosen
+independently of the output arguments, by using
+--input-fields-terminated-by
, and so on. This is useful, for example, to
+generate classes which can parse records created with one set of
+delimiters, and emit the records to a different set of files using a
+separate set of delimiters.
Table 7. Hive arguments:
+ Argument + | + Description + |
---|---|
+ --hive-home <dir>
+ |
+ Override $HIVE_HOME
+ |
+ --hive-import
+ | + Import tables into Hive (Uses Hive’s default delimiters if none are set.) + |
+ --hive-overwrite
+ | + Overwrite existing data in the Hive table. + |
+ --create-hive-table
+ | + If set, then the job will fail if the target hive + |
+ | + table exits. By default this property is false. + |
+ --hive-table <table-name>
+ | + Sets the table name to use when importing to Hive. + |
+ --hive-drop-import-delims
+ | + Drops \n, \r, and \01 from string fields when importing to Hive. + |
+ --hive-delims-replacement
+ | + Replace \n, \r, and \01 from string fields with user defined string when importing to Hive. + |
+ --hive-partition-key
+ | + Name of a hive field to partition are sharded on + |
+ --hive-partition-value <v>
+ | + String-value that serves as partition key for this imported into hive in this job. + |
+ --map-column-hive <map>
+ | + Override default mapping from SQL type to Hive type for configured columns. + |
Sqoop’s import tool’s main function is to upload your data into files
+in HDFS. If you have a Hive metastore associated with your HDFS
+cluster, Sqoop can also import the data into Hive by generating and
+executing a CREATE TABLE
statement to define the data’s layout in
+Hive. Importing data into Hive is as simple as adding the
+--hive-import
option to your Sqoop command line.
If the Hive table already exists, you can specify the
+--hive-overwrite
option to indicate that existing table in hive must
+be replaced. After your data is imported into HDFS or this step is
+omitted, Sqoop will generate a Hive script containing a CREATE TABLE
+operation defining your columns using Hive’s types, and a LOAD DATA INPATH
+statement to move the data files into Hive’s warehouse directory.
The script will be executed by calling
+the installed copy of hive on the machine where Sqoop is run. If you have
+multiple Hive installations, or hive
is not in your $PATH
, use the
+--hive-home
option to identify the Hive installation directory.
+Sqoop will use $HIVE_HOME/bin/hive
from here.
![]() | Note |
---|---|
This function is incompatible with |
Even though Hive supports escaping characters, it does not
+handle escaping of new-line character. Also, it does not support
+the notion of enclosing characters that may include field delimiters
+in the enclosed string. It is therefore recommended that you choose
+unambiguous field and record-terminating delimiters without the help
+of escaping and enclosing characters when working with Hive; this is
+due to limitations of Hive’s input parsing abilities. If you do use
+--escaped-by
, --enclosed-by
, or --optionally-enclosed-by
when
+importing data into Hive, Sqoop will print a warning message.
Hive will have problems using Sqoop-imported data if your database’s
+rows contain string fields that have Hive’s default row delimiters
+(\n
and \r
characters) or column delimiters (\01
characters)
+present in them. You can use the --hive-drop-import-delims
option
+to drop those characters on import to give Hive-compatible text data.
+Alternatively, you can use the --hive-delims-replacement
option
+to replace those characters with a user-defined string on import to give
+Hive-compatible text data. These options should only be used if you use
+Hive’s default delimiters and should not be used if different delimiters
[... 1574 lines stripped ...]