hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravi Prakash (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8065) distcp should have an option to compress data while copying.
Date Tue, 03 May 2016 23:50:13 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15269857#comment-15269857

Ravi Prakash commented on HADOOP-8065:

Thanks for the patch [~snayakm]! Here are some of my thoughts:

# What users seem to want, is to be able to compress data *during transit*. {color:red}*This
patch does not enable compression of data during transit.*{color} Distcp is simply an MR job
where maps are reading from a "source" . If the source does not support compressing the data
before putting it on the network, I don't see how we could achieve what these users want.
# *We are simply enabling users to avoid a post-processing step to compress the data they
have already transferred*. This too is a noble goal if it makes the lives of users easier
IMHO. It also reduces the amount of space needed on the target filesystem. We should rewrite
the JIRA summary to be more explicit if that is the stated goal.

Reviewing the patch:
# Do you really need the changes in {{CopyMapper}}?
# Nit: {{getCompressionCodcec}} is misspelt
# Instead of {code}      e.printStackTrace();
      LOG.error("Compression class " + compressionCodecClass
          + " not found in classpath");{code} you can simply pass {{e}} as a second argument
to the LOG.error method.
# With this patch, we'll end up creating an instance of a Codec for every file. Do you think
we could utilize something like {{org.apache.hadoop.io.compress.CodecPool}}?
# Perhaps we can add an option {{-compressOutput}} which defaults to some codec?
# Although its conceivable that we may want to decompress before writing to the target filesystem,
we can punt that to another JIRA.

Thanks for your efforts! :-)

> distcp should have an option to compress data while copying.
> ------------------------------------------------------------
>                 Key: HADOOP-8065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8065
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.20.2
>            Reporter: Suresh Antony
>            Assignee: Suraj Nayak
>            Priority: Minor
>              Labels: distcp
>             Fix For: 0.20.2
>         Attachments: HADOOP-8065-trunk_2015-11-03.patch, HADOOP-8065-trunk_2015-11-04.patch,
HADOOP-8065-trunk_2016-04-29-4.patch, patch.distcp.2012-02-10
> We would like compress the data while transferring from our source system to target system.
One way to do this is to write a map/reduce job to compress that after/before being transferred.
This looks inefficient. 
> Since distcp already reading writing data it would be better if it can accomplish while
doing this. 
> Flip side of this is that distcp -update option can not check file size before copying
data. It can only check for the existence of file. 
> So I propose if -compress option is given then file size is not checked.
> Also when we copy file appropriate extension needs to be added to file depending on compression

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message