sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Controlling compression during import
Date Mon, 05 Sep 2011 19:32:27 GMT

On Sep 5, 2011, at 12:12pm, Arvind Prabhakar wrote:

> On Sun, Sep 4, 2011 at 3:49 PM, Ken Krugler <kkrugler_lists@transpac.com> wrote:
>> Hi there,
>> The current documentation says:
>> By default, data is not compressed. You can compress your data by using the
>> deflate (gzip) algorithm with the -z or --compress argument, or specify any
>> Hadoop compression codec using the --compression-codec argument. This
>> applies to both SequenceFiles or text files.
>> But I think this is a bit misleading.
>> Currently if output compression is enabled in a cluster, then the Sqooped
>> data is alway compressed, regardless of the setting of this flag.
>> It seems better to actually make compression controllable via --compress,
>> which means changing ImportJobBase.configureOutputFormat()
>>     if (options.shouldUseCompression()) {
>>       FileOutputFormat.setCompressOutput(job, true);
>>       FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
>>       SequenceFileOutputFormat.setOutputCompressionType(job,
>>           CompressionType.BLOCK);
>>     }
>>    // new stuff
>>     else {
>>       FileOutputFormat.setCompressOutput(job, false);
>>     }
>> Thoughts?
> This is a good point Ken. However, IMO it is better left as is since
> there may be a wider cluster management policy in effect that requires
> compression for all output files. One way to look at it is that for
> normal use, there is a predefined compression scheme configured
> cluster wide, and occasionally when required, Sqoop users can use a
> different scheme where necessary.

The problem is that when you use text files as Sqoop output, these get compressed at the file
level by (typically) deflate, gzip or lzo.

So you wind up with unsplittable files, which means that the degree of parallelism during
the next step of processing is constrained by the number of mappers used during sqooping.
But you typically set the number of mappers based on DB load & size of the data set.

And if partitioning isn't great, then you also wind up with heavily skewed sizes for these
unsplittable files, which makes things even worse.

The current work-around is to use binary or Avro output instead of text, but that's an odd
requirement to be able to avoid the above problem.

If the argument is to avoid implicitly changing the cluster's default compression policy,
then I'd suggest supporting a -nocompression flag.


-- Ken

Ken Krugler
+1 530-210-6378
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

View raw message