hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nathan Roberts (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13114) DistCp should have option to compress data on write
Date Tue, 10 Jan 2017 15:31:58 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15815294#comment-15815294

Nathan Roberts commented on HADOOP-13114:

Sorry for jumping in late. I tend to agree this seems like it might be outside the scope of
distcp. I understand the desire to support this capability but it seems like the use-cases
get strange if we fold it into distcp itself. It might be as simple as creating a new command:
"distcompress" or something similar, which could share exactly the same code-base as distcp
but only has this new capability in that mode. Some of the worries I have with having it in
distcp are:
- Just the name bothers me a bit. copy commands don't normally transform data, but this one
- What happens if we run the command with compression twice? distcp a->b, then b->c?
I'm assuming c is a compressed version of b which is a compressed version of a. In order to
read we'd have to unwind both layers of compression. Seems strange and really easy to accidentally
have this happen.
- I'm assuming CRC checks have to be disabled when doing this. Did we force the user to disable
CRC checks by providing the necessary option or did we just do it automatically? If automatic,
should WARN them this happened.
- Obvious question is: "if it's valuable to compress, why wasn't it compressed in the first

> DistCp should have option to compress data on write
> ---------------------------------------------------
>                 Key: HADOOP-13114
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13114
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>            Reporter: Suraj Nayak
>            Assignee: Suraj Nayak
>            Priority: Minor
>              Labels: distcp
>         Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, HADOOP-13114-trunk_2016-05-08-1.patch,
HADOOP-13114-trunk_2016-05-10-1.patch, HADOOP-13114-trunk_2016-05-12-1.patch, HADOOP-13114.05.patch,
>   Original Estimate: 48h
>  Remaining Estimate: 48h
> DistCp utility should have capability to store data in user specified compression format.
This avoids one hop of compressing data after transfer. Backup strategies to different cluster
also get benefit of saving one IO operation to and from HDFS, thus saving resources, time
and effort.
> * Create an option -compressOutput defaulting to {{org.apache.hadoop.io.compress.BZip2Codec}}.

> * Users will be able to change codec with {{-D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec extension
to indicate the file is compressed. Thus users can be aware of what codec was used to compress
the data.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message