sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mauricio Aristizabal <mauri...@impactradius.com>
Subject Re: Sqoop Incremental job does not honor the compression settings after the initial run
Date Tue, 05 May 2015 20:50:41 GMT
We've never been able to get sqoop-merge to use compression.  Even tried
setting MR output properties.  I do hope you guys can figure out the issue,
would save us a lot of space.

On Tue, May 5, 2015 at 1:45 PM, Michael Arena <marena@paytronix.com> wrote:

>   Sqoop 1.4.5-cdh5.2.1 on Cloudera CDH 5.2.1 cluster
>
>   From: Abraham Elmahrek
> Reply-To: "user@sqoop.apache.org"
> Date: Tuesday, May 5, 2015 at 4:18 PM
> To: "user@sqoop.apache.org"
> Subject: Re: Sqoop Incremental job does not honor the compression
> settings after the initial run
>
>   This seems like a bug. Which version of Sqoop are you using?
>
> On Tue, May 5, 2015 at 12:50 PM, Michael Arena <marena@paytronix.com>
> wrote:
>
>>   I am incrementally loading data from SQL Server to Hadoop using an
>> Oozie Sqoop Action.
>> Oozie runs a saved job in the Sqoop Metastore as created below:
>>
>>  sqoop job \
>>    --create import__test__mydb__mytable \
>>     --meta-connect *** \
>>    -- import \
>>     --connect "jdbc:sqlserver://mydbserver:1433;databaseName=mydb;" \
>>    --username **** \
>>    --password-file **** \
>>    --num-mappers 4 \
>>    --target-dir /***/***/mytable \
>>    --fields-terminated-by '\t' --input-fields-terminated-by '\t' \
>>    --null-string '\\N' --null-non-string '\\N' \
>>    --input-null-string '\\N' --input-null-non-string '\\N' \
>>    --relaxed-isolation \
>>    --query "SELECT id, first_name, last_name, mod_time FROM mytable" \
>>    --split-by id \
>>    --merge-key id \
>>    --incremental lastmodified \
>>    --check-column mod_time \
>>    --last-value "1900-01-01 00:00:00.000" \
>>     --compress --compression-codec
>> org.apache.hadoop.io.compress.SnappyCodec
>>
>>
>>  The initial time the job runs, it creates 4 files like:
>>  part-m-00000.snappy
>>  part-m-00002.snappy
>>  part-m-00003.snappy
>>  part-m-00004.snappy
>>
>>  It did not need to do the "merge" step since there was no existing data.
>>
>>  However, the next time it runs, it pulls over modified rows from SQL
>> Server and then "merges" them into the existing data and creates files:
>>  part-r-00000
>>  part-r-00001
>>  part-r-00002
>>  ...
>>  part-r-00020
>>  part-r-00031
>>
>>  which are uncompressed TSV files.
>>
>>
>>  The Sqoop Metastore has the compression settings saved:
>>  % sqoop job --show import__test__mydb__mytable
>>  ...
>>  enable.compression = true
>> compression.codec = org.apache.hadoop.io.compress.SnappyCodec
>>  ...
>>
>>
>>  Since the files are named "part-m-0000X.snappy" after the first run, I
>> am guessing that the "-m-" in the name means the mappers created them (and
>> also since I specified 4 mappers).
>>
>>  On the second run, I am guessing that the (32?) reducers created the
>> output since there was merging necessary and the files have "-r-" in the
>> name.
>>
>>  Is this a bug or expected behavior?
>>  Is there some other settings to tell the reducers to honor the
>> compression settings?
>>  If it is a bug, where do I create an issue (JIRA) for it?
>>
>>
>> How are you engaging with millennials at your organization? Earn
>> “Lifetime Loyalty with Effective Millennial Engagement” by signing up for
>> our next webinar. Join us *Tuesday, May 12 at 1:00 EDT *to obtain the
>> tools you need to earn brand loyalty from this important demographic. Click
>> here <http://content.paytronix.com/Lifetime-Loyalty_0515_sig.html>to
>> register!
>>
>
>
>
> How are you engaging with millennials at your organization? Earn “Lifetime
> Loyalty with Effective Millennial Engagement” by signing up for our next
> webinar. Join us *Tuesday, May 12 at 1:00 EDT *to obtain the tools you
> need to earn brand loyalty from this important demographic. Click here
> <http://content.paytronix.com/Lifetime-Loyalty_0515_sig.html> to
> register!
>



-- 

*Mauricio Aristizabal*

Manager - Business Intelligence + Data Science | Impact Radius

10 East Figueroa Street, 2nd Floor | Santa Barbara, CA 93101

m: +1 (323) 309-4260 | mauricio@impactradius.com


*Learn more  – Watch our 2 minute overview
<http://www.impactradius.com/?src=slsap>*


www.impactradius.com | Twitter <http://twitter.com/impactradius> | Facebook
<https://www.facebook.com/pages/Impact-Radius/153376411365183> | LinkedIn
<http://www.linkedin.com/company/impact-radius-inc.> | YouTube
<https://www.youtube.com/user/ImpactRadius>

Maximizing Return on Ad Spend

Mime
View raw message