sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abraham Elmahrek <...@cloudera.com>
Subject Re: Sqoop Incremental job does not honor the compression settings after the initial run
Date Tue, 05 May 2015 23:11:51 GMT
Yeah this is a problem. I've created
https://issues.apache.org/jira/browse/SQOOP-2346 to track this. Thanks for
letting the community know.

-Abe

On Tue, May 5, 2015 at 1:50 PM, Mauricio Aristizabal <
mauricio@impactradius.com> wrote:

> We've never been able to get sqoop-merge to use compression.  Even tried
> setting MR output properties.  I do hope you guys can figure out the issue,
> would save us a lot of space.
>
> On Tue, May 5, 2015 at 1:45 PM, Michael Arena <marena@paytronix.com>
> wrote:
>
>>   Sqoop 1.4.5-cdh5.2.1 on Cloudera CDH 5.2.1 cluster
>>
>>   From: Abraham Elmahrek
>> Reply-To: "user@sqoop.apache.org"
>> Date: Tuesday, May 5, 2015 at 4:18 PM
>> To: "user@sqoop.apache.org"
>> Subject: Re: Sqoop Incremental job does not honor the compression
>> settings after the initial run
>>
>>   This seems like a bug. Which version of Sqoop are you using?
>>
>> On Tue, May 5, 2015 at 12:50 PM, Michael Arena <marena@paytronix.com>
>> wrote:
>>
>>>   I am incrementally loading data from SQL Server to Hadoop using an
>>> Oozie Sqoop Action.
>>> Oozie runs a saved job in the Sqoop Metastore as created below:
>>>
>>>  sqoop job \
>>>    --create import__test__mydb__mytable \
>>>     --meta-connect *** \
>>>    -- import \
>>>     --connect "jdbc:sqlserver://mydbserver:1433;databaseName=mydb;" \
>>>    --username **** \
>>>    --password-file **** \
>>>    --num-mappers 4 \
>>>    --target-dir /***/***/mytable \
>>>    --fields-terminated-by '\t' --input-fields-terminated-by '\t' \
>>>    --null-string '\\N' --null-non-string '\\N' \
>>>    --input-null-string '\\N' --input-null-non-string '\\N' \
>>>    --relaxed-isolation \
>>>    --query "SELECT id, first_name, last_name, mod_time FROM mytable" \
>>>    --split-by id \
>>>    --merge-key id \
>>>    --incremental lastmodified \
>>>    --check-column mod_time \
>>>    --last-value "1900-01-01 00:00:00.000" \
>>>     --compress --compression-codec
>>> org.apache.hadoop.io.compress.SnappyCodec
>>>
>>>
>>>  The initial time the job runs, it creates 4 files like:
>>>  part-m-00000.snappy
>>>  part-m-00002.snappy
>>>  part-m-00003.snappy
>>>  part-m-00004.snappy
>>>
>>>  It did not need to do the "merge" step since there was no existing
>>> data.
>>>
>>>  However, the next time it runs, it pulls over modified rows from SQL
>>> Server and then "merges" them into the existing data and creates files:
>>>  part-r-00000
>>>  part-r-00001
>>>  part-r-00002
>>>  ...
>>>  part-r-00020
>>>  part-r-00031
>>>
>>>  which are uncompressed TSV files.
>>>
>>>
>>>  The Sqoop Metastore has the compression settings saved:
>>>  % sqoop job --show import__test__mydb__mytable
>>>  ...
>>>  enable.compression = true
>>> compression.codec = org.apache.hadoop.io.compress.SnappyCodec
>>>  ...
>>>
>>>
>>>  Since the files are named "part-m-0000X.snappy" after the first run, I
>>> am guessing that the "-m-" in the name means the mappers created them (and
>>> also since I specified 4 mappers).
>>>
>>>  On the second run, I am guessing that the (32?) reducers created the
>>> output since there was merging necessary and the files have "-r-" in the
>>> name.
>>>
>>>  Is this a bug or expected behavior?
>>>  Is there some other settings to tell the reducers to honor the
>>> compression settings?
>>>  If it is a bug, where do I create an issue (JIRA) for it?
>>>
>>>
>>> How are you engaging with millennials at your organization? Earn
>>> “Lifetime Loyalty with Effective Millennial Engagement” by signing up for
>>> our next webinar. Join us *Tuesday, May 12 at 1:00 EDT *to obtain the
>>> tools you need to earn brand loyalty from this important demographic. Click
>>> here <http://content.paytronix.com/Lifetime-Loyalty_0515_sig.html>to
>>> register!
>>>
>>
>>
>>
>> How are you engaging with millennials at your organization? Earn
>> “Lifetime Loyalty with Effective Millennial Engagement” by signing up for
>> our next webinar. Join us *Tuesday, May 12 at 1:00 EDT *to obtain the
>> tools you need to earn brand loyalty from this important demographic. Click
>> here <http://content.paytronix.com/Lifetime-Loyalty_0515_sig.html> to
>> register!
>>
>
>
>
> --
>
> *Mauricio Aristizabal*
>
> Manager - Business Intelligence + Data Science | Impact Radius
>
> 10 East Figueroa Street, 2nd Floor | Santa Barbara, CA 93101
>
> m: +1 (323) 309-4260 | mauricio@impactradius.com
>
>
> *Learn more  – Watch our 2 minute overview
> <http://www.impactradius.com/?src=slsap>*
>
>
> www.impactradius.com | Twitter <http://twitter.com/impactradius> |
> Facebook <https://www.facebook.com/pages/Impact-Radius/153376411365183> |
> LinkedIn <http://www.linkedin.com/company/impact-radius-inc.> | YouTube
> <https://www.youtube.com/user/ImpactRadius>
>
> Maximizing Return on Ad Spend
>
>
>

Mime
View raw message