hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-14137) Allow DistCp to take a file list within a src directory
Date Thu, 02 Mar 2017 02:52:45 GMT
Zheng Shao created HADOOP-14137:

             Summary: Allow DistCp to take a file list within a src directory
                 Key: HADOOP-14137
                 URL: https://issues.apache.org/jira/browse/HADOOP-14137
             Project: Hadoop Common
          Issue Type: New Feature
          Components: tools/distcp
            Reporter: Zheng Shao

DistCp is very slow to start when the src directory has a huge number of subdirectories. 
In our case, we already have the directory listing (via "hdfs oiv -i fsimage" or via nightly
"hdfs dfs -lr -r /" dumps), and we would like to use that instead of doing realtime listing
on the NameNode.

The "-f" option doesn't help in this case because it would try to put everything into a single
flat target directory.

We'd like to introduce a new option "-list <file>" for distcp.  The <file> contains
the result of listing the src directory.

In order to achieve this, we plan to:
1. Add a new CopyListing class PregeneratedCopyListing similar to SimpleCopyListing which
doesn't "-ls -r" into the directory, but takes the listing via "-list"
2. Add an option "-list <file>" which will automatically make distcp use the new PregeneratedCopyListing

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message