sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kate Ting <k...@cloudera.com>
Subject Re: Controlling compression during import
Date Tue, 06 Sep 2011 13:58:42 GMT
Hi Ken, you make some good points, to which I've added comments individually.

re: the degree of parallelism during the next step of processing is
constrained by the number of mappers used during sqooping: does
https://issues.cloudera.org/browse/SQOOP-137 address it? If so, you
might want to add your comments there.

re: winding up with unsplittable files and heavily skewed sizes: you
can file separate JIRAs for those if desired.

re: partitioning isn't great: for some databases such as Oracle, the
problem of heavily skewed sizes can be overcome using row-ids, you can
file a JIRA for that if you feel it's needed.

Regards, Kate

On Mon, Sep 5, 2011 at 12:32 PM, Ken Krugler
<kkrugler_lists@transpac.com> wrote:
> On Sep 5, 2011, at 12:12pm, Arvind Prabhakar wrote:
>> On Sun, Sep 4, 2011 at 3:49 PM, Ken Krugler <kkrugler_lists@transpac.com> wrote:
>>> Hi there,
>>> The current documentation says:
>>> By default, data is not compressed. You can compress your data by using the
>>> deflate (gzip) algorithm with the -z or --compress argument, or specify any
>>> Hadoop compression codec using the --compression-codec argument. This
>>> applies to both SequenceFiles or text files.
>>> But I think this is a bit misleading.
>>> Currently if output compression is enabled in a cluster, then the Sqooped
>>> data is alway compressed, regardless of the setting of this flag.
>>> It seems better to actually make compression controllable via --compress,
>>> which means changing ImportJobBase.configureOutputFormat()
>>>     if (options.shouldUseCompression()) {
>>>       FileOutputFormat.setCompressOutput(job, true);
>>>       FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
>>>       SequenceFileOutputFormat.setOutputCompressionType(job,
>>>           CompressionType.BLOCK);
>>>     }
>>>    // new stuff
>>>     else {
>>>       FileOutputFormat.setCompressOutput(job, false);
>>>     }
>>> Thoughts?
>> This is a good point Ken. However, IMO it is better left as is since
>> there may be a wider cluster management policy in effect that requires
>> compression for all output files. One way to look at it is that for
>> normal use, there is a predefined compression scheme configured
>> cluster wide, and occasionally when required, Sqoop users can use a
>> different scheme where necessary.
> The problem is that when you use text files as Sqoop output, these get compressed at
the file level by (typically) deflate, gzip or lzo.
> So you wind up with unsplittable files, which means that the degree of parallelism during
the next step of processing is constrained by the number of mappers used during sqooping.
But you typically set the number of mappers based on DB load & size of the data set.
> And if partitioning isn't great, then you also wind up with heavily skewed sizes for
these unsplittable files, which makes things even worse.
> The current work-around is to use binary or Avro output instead of text, but that's an
odd requirement to be able to avoid the above problem.
> If the argument is to avoid implicitly changing the cluster's default compression policy,
then I'd suggest supporting a -nocompression flag.
> Regards,
> -- Ken
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr

View raw message