sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Controlling compression during import
Date Tue, 06 Sep 2011 14:07:43 GMT

On Sep 6, 2011, at 6:58am, Kate Ting wrote:

> Hi Ken, you make some good points, to which I've added comments individually.
> 
> re: the degree of parallelism during the next step of processing is
> constrained by the number of mappers used during sqooping: does
> https://issues.cloudera.org/browse/SQOOP-137 address it? If so, you
> might want to add your comments there.

Thanks for the ref, and yes that would help.

> re: winding up with unsplittable files and heavily skewed sizes: you
> can file separate JIRAs for those if desired.

That's not an issue for Sqoop - rather just how Hadoop works.

> re: partitioning isn't great: for some databases such as Oracle, the
> problem of heavily skewed sizes can be overcome using row-ids, you can
> file a JIRA for that if you feel it's needed.

Again, not really a Sqoop issue. Things are fine with OraOop.

When we fall back to regular Sqoop, we don't have a good column to use for partitioning, so
the results wind up being heavily skewed. But I don't think there's anything Sqoop could do
to easily solve that problem.

Regards,

-- Ken


> On Mon, Sep 5, 2011 at 12:32 PM, Ken Krugler
> <kkrugler_lists@transpac.com> wrote:
>> 
>> On Sep 5, 2011, at 12:12pm, Arvind Prabhakar wrote:
>> 
>>> On Sun, Sep 4, 2011 at 3:49 PM, Ken Krugler <kkrugler_lists@transpac.com>
wrote:
>>>> Hi there,
>>>> The current documentation says:
>>>> 
>>>> By default, data is not compressed. You can compress your data by using the
>>>> deflate (gzip) algorithm with the -z or --compress argument, or specify any
>>>> Hadoop compression codec using the --compression-codec argument. This
>>>> applies to both SequenceFiles or text files.
>>>> 
>>>> But I think this is a bit misleading.
>>>> Currently if output compression is enabled in a cluster, then the Sqooped
>>>> data is alway compressed, regardless of the setting of this flag.
>>>> It seems better to actually make compression controllable via --compress,
>>>> which means changing ImportJobBase.configureOutputFormat()
>>>>     if (options.shouldUseCompression()) {
>>>>       FileOutputFormat.setCompressOutput(job, true);
>>>>       FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
>>>>       SequenceFileOutputFormat.setOutputCompressionType(job,
>>>>           CompressionType.BLOCK);
>>>>     }
>>>>    // new stuff
>>>>     else {
>>>>       FileOutputFormat.setCompressOutput(job, false);
>>>>     }
>>>> Thoughts?
>>> 
>>> This is a good point Ken. However, IMO it is better left as is since
>>> there may be a wider cluster management policy in effect that requires
>>> compression for all output files. One way to look at it is that for
>>> normal use, there is a predefined compression scheme configured
>>> cluster wide, and occasionally when required, Sqoop users can use a
>>> different scheme where necessary.
>> 
>> The problem is that when you use text files as Sqoop output, these get compressed
at the file level by (typically) deflate, gzip or lzo.
>> 
>> So you wind up with unsplittable files, which means that the degree of parallelism
during the next step of processing is constrained by the number of mappers used during sqooping.
But you typically set the number of mappers based on DB load & size of the data set.
>> 
>> And if partitioning isn't great, then you also wind up with heavily skewed sizes
for these unsplittable files, which makes things even worse.
>> 
>> The current work-around is to use binary or Avro output instead of text, but that's
an odd requirement to be able to avoid the above problem.
>> 
>> If the argument is to avoid implicitly changing the cluster's default compression
policy, then I'd suggest supporting a -nocompression flag.
>> 
>> Regards,
>> 
>> -- Ken
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> custom big data solutions & training
>> Hadoop, Cascading, Mahout & Solr
>> 
>> 
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Mime
View raw message