hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron T. Myers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel
Date Thu, 26 Jan 2017 01:40:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838988#comment-15838988

Aaron T. Myers commented on HADOOP-11794:

Latest patch looks pretty good to me. Just a few small comments from me:

# "randomdize" -> "randomize": {{// When splitLargeFile is enabled, we don't randomdize
the copylist}}
# In two places you have basically "if (LOG.isDebugEnabled) { LOG.warn(...); }" You should
do {{LOG.debug(...)}} in these places, and perhaps also make these debug messages a little
more helpful instead of just "add1", which would require someone to read the source code to
# I think this log message is a little misleading:
+      new Option("chunksize", true, "Size of chunk in number of blocks when " +
+          "splitting large files into chunks to copy in parallel")),
Assuming I'm reading the code correctly, the way a file is determined to be "large" in this
context is just if it has more blocks than the configured chunk size. This log message also
seems to imply that there might be some other configuration option to enable/disable splitting
large files at all. I think better text would be something like "If set to a positive value,
files with more blocks than this value will be split at their block boundaries during transfer,
and reassembled on the destination cluster. By default, files will be transmitted in their
entirety without splitting."
# Rather than suppressing the checkstyle warnings, recommend implementing the builder pattern
for the {{CopyListingFileStatus}} constructors. That should make things quite a bit clearer.
# There are a handful of lines that are changed that I think are just whitespace, but not
a big deal.

> distcp can copy blocks in parallel
> ----------------------------------
>                 Key: HADOOP-11794
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11794
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: dhruba borthakur
>            Assignee: Yongjun Zhang
>         Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, MAPREDUCE-2257.patch
> The minimum unit of work for a distcp task is a file. We have files that are greater
than 1 TB with a block size of  1 GB. If we use distcp to copy these files, the tasks either
take a long long long time or finally fails. A better way for distcp would be to copy all
the source blocks in parallel, and then stich the blocks back to files at the destination
via the HDFS Concat API (HDFS-222)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message