sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Arena <mar...@paytronix.com>
Subject Re: Sqoop Incremental job does not honor the compression settings after the initial run
Date Tue, 05 May 2015 20:45:18 GMT
Sqoop 1.4.5-cdh5.2.1 on Cloudera CDH 5.2.1 cluster

From: Abraham Elmahrek
Reply-To: "user@sqoop.apache.org<mailto:user@sqoop.apache.org>"
Date: Tuesday, May 5, 2015 at 4:18 PM
To: "user@sqoop.apache.org<mailto:user@sqoop.apache.org>"
Subject: Re: Sqoop Incremental job does not honor the compression settings after the initial
run

This seems like a bug. Which version of Sqoop are you using?

On Tue, May 5, 2015 at 12:50 PM, Michael Arena <marena@paytronix.com<mailto:marena@paytronix.com>>
wrote:
I am incrementally loading data from SQL Server to Hadoop using an Oozie Sqoop Action.
Oozie runs a saved job in the Sqoop Metastore as created below:

sqoop job \
   --create import__test__mydb__mytable \
   --meta-connect *** \
   -- import \
   --connect "jdbc:sqlserver://mydbserver:1433;databaseName=mydb;" \
   --username **** \
   --password-file **** \
   --num-mappers 4 \
   --target-dir /***/***/mytable \
   --fields-terminated-by '\t' --input-fields-terminated-by '\t' \
   --null-string '\\N' --null-non-string '\\N' \
   --input-null-string '\\N' --input-null-non-string '\\N' \
   --relaxed-isolation \
   --query "SELECT id, first_name, last_name, mod_time FROM mytable" \
   --split-by id \
   --merge-key id \
   --incremental lastmodified \
   --check-column mod_time \
   --last-value "1900-01-01 00:00:00.000" \
   --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec


The initial time the job runs, it creates 4 files like:
part-m-00000.snappy
part-m-00002.snappy
part-m-00003.snappy
part-m-00004.snappy

It did not need to do the "merge" step since there was no existing data.

However, the next time it runs, it pulls over modified rows from SQL Server and then "merges"
them into the existing data and creates files:
part-r-00000
part-r-00001
part-r-00002
...
part-r-00020
part-r-00031

which are uncompressed TSV files.


The Sqoop Metastore has the compression settings saved:
% sqoop job --show import__test__mydb__mytable
...
enable.compression = true
compression.codec = org.apache.hadoop.io.compress.SnappyCodec
...


Since the files are named "part-m-0000X.snappy" after the first run, I am guessing that the
"-m-" in the name means the mappers created them (and also since I specified 4 mappers).

On the second run, I am guessing that the (32?) reducers created the output since there was
merging necessary and the files have "-r-" in the name.

Is this a bug or expected behavior?
Is there some other settings to tell the reducers to honor the compression settings?
If it is a bug, where do I create an issue (JIRA) for it?


How are you engaging with millennials at your organization? Earn “Lifetime Loyalty with
Effective Millennial Engagement” by signing up for our next webinar. Join us Tuesday, May
12 at 1:00 EDT to obtain the tools you need to earn brand loyalty from this important demographic.
Click here <http://content.paytronix.com/Lifetime-Loyalty_0515_sig.html> to register!



How are you engaging with millennials at your organization? Earn “Lifetime Loyalty with
Effective Millennial Engagement” by signing up for our next webinar. Join us Tuesday, May
12 at 1:00 EDT to obtain the tools you need to earn brand loyalty from this important demographic.
Click here <http://content.paytronix.com/Lifetime-Loyalty_0515_sig.html> to register!
Mime
View raw message