hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.
Date Wed, 18 May 2016 23:23:12 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290079#comment-15290079

Chris Nauroth commented on HADOOP-13145:

Interestingly, you're getting a much slower run than me for S3A and a much faster run than
me for WASB.  I'm in the US Pacific Northwest.  My S3 bucket is in US-west-2.  My Azure Storage
account is in West US.

Running org.apache.hadoop.fs.azure.contract.TestAzureNativeContractDistCp
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 140.389 sec - in org.apache.hadoop.fs.azure.contract.TestAzureNativeContractDistCp

Running org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 143.99 sec - in org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp

bq. Could it be made one of the scaleable tests where it takes a config of option on scale
so can be made configurable?

We definitely could do that, but in my test runs, the large file tests don't show a significantly
longer execution time.  (See below for my timings.)  Are the large file tests a long haul
in your environment?

Maybe a more effective change would be to cut down the number of test cases.  I could keep
just {{deepDirectoryStructureToRemote}}, {{largeFilesToRemote}}, {{deepDirectoryStructureFromRemote}}
and {{largeFilesFromRemote}}.  If I do that, then my S3A execution time comes down to 90 seconds.
 I don't think it sacrifices much in terms of coverage.

Let me know your thoughts, and then I'll update the patch.

  <testcase name="multipleFilesToRemote" classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"
  <testcase name="deepDirectoryStructureFromRemote" classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"
  <testcase name="deepDirectoryStructureToRemote" classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"
  <testcase name="largeFilesToRemote" classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"
  <testcase name="singleFileToRemote" classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"
  <testcase name="largeFilesFromRemote" classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"
  <testcase name="multipleFilesFromRemote" classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"
  <testcase name="singleFileFromRemote" classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"

> In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.
> -------------------------------------------------------------------------------
>                 Key: HADOOP-13145
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13145
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: HADOOP-13145.001.patch, HADOOP-13145.003.patch
> After DistCp copies a file, it calls {{getFileStatus}} to get the {{FileStatus}} from
the destination so that it can compare to the source and update metadata if necessary.  If
the DistCp command was run without the option to preserve metadata attributes, then this additional
{{getFileStatus}} call is wasteful.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message