sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abraham Elmahrek <...@cloudera.com>
Subject Re: Sqoop Incremental job does not honor the compression settings after the initial run
Date Tue, 05 May 2015 20:18:34 GMT
This seems like a bug. Which version of Sqoop are you using?

On Tue, May 5, 2015 at 12:50 PM, Michael Arena <marena@paytronix.com> wrote:

>   I am incrementally loading data from SQL Server to Hadoop using an
> Oozie Sqoop Action.
> Oozie runs a saved job in the Sqoop Metastore as created below:
>
>  sqoop job \
>    --create import__test__mydb__mytable \
>     --meta-connect *** \
>    -- import \
>     --connect "jdbc:sqlserver://mydbserver:1433;databaseName=mydb;" \
>    --username **** \
>    --password-file **** \
>    --num-mappers 4 \
>    --target-dir /***/***/mytable \
>    --fields-terminated-by '\t' --input-fields-terminated-by '\t' \
>    --null-string '\\N' --null-non-string '\\N' \
>    --input-null-string '\\N' --input-null-non-string '\\N' \
>    --relaxed-isolation \
>    --query "SELECT id, first_name, last_name, mod_time FROM mytable" \
>    --split-by id \
>    --merge-key id \
>    --incremental lastmodified \
>    --check-column mod_time \
>    --last-value "1900-01-01 00:00:00.000" \
>     --compress --compression-codec
> org.apache.hadoop.io.compress.SnappyCodec
>
>
>  The initial time the job runs, it creates 4 files like:
>  part-m-00000.snappy
>  part-m-00002.snappy
>  part-m-00003.snappy
>  part-m-00004.snappy
>
>  It did not need to do the "merge" step since there was no existing data.
>
>  However, the next time it runs, it pulls over modified rows from SQL
> Server and then "merges" them into the existing data and creates files:
>  part-r-00000
>  part-r-00001
>  part-r-00002
>  ...
>  part-r-00020
>  part-r-00031
>
>  which are uncompressed TSV files.
>
>
>  The Sqoop Metastore has the compression settings saved:
>  % sqoop job --show import__test__mydb__mytable
>  ...
>  enable.compression = true
> compression.codec = org.apache.hadoop.io.compress.SnappyCodec
>  ...
>
>
>  Since the files are named "part-m-0000X.snappy" after the first run, I
> am guessing that the "-m-" in the name means the mappers created them (and
> also since I specified 4 mappers).
>
>  On the second run, I am guessing that the (32?) reducers created the
> output since there was merging necessary and the files have "-r-" in the
> name.
>
>  Is this a bug or expected behavior?
>  Is there some other settings to tell the reducers to honor the
> compression settings?
>  If it is a bug, where do I create an issue (JIRA) for it?
>
>
> How are you engaging with millennials at your organization? Earn “Lifetime
> Loyalty with Effective Millennial Engagement” by signing up for our next
> webinar. Join us *Tuesday, May 12 at 1:00 EDT *to obtain the tools you
> need to earn brand loyalty from this important demographic. Click here
> <http://content.paytronix.com/Lifetime-Loyalty_0515_sig.html> to
> register!
>

Mime
View raw message