hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Olson (Jira)" <j...@apache.org>
Subject [jira] [Assigned] (HADOOP-16047) Avoid expensive rename when DistCp is writing to S3
Date Wed, 25 Mar 2020 19:12:00 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-16047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Olson reassigned HADOOP-16047:
-------------------------------------

    Assignee: Andrew Olson

> Avoid expensive rename when DistCp is writing to S3
> ---------------------------------------------------
>
>                 Key: HADOOP-16047
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16047
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3, tools/distcp
>            Reporter: Andrew Olson
>            Assignee: Andrew Olson
>            Priority: Major
>
> When writing to an S3-based target, the temp file and rename logic in RetriableFileCopyCommand
adds some unnecessary cost to the job, as the rename operation does a server-side copy + delete
in S3 [1]. The renames are parallelized across all of the DistCp map tasks, so the severity
is mitigated to some extent. However a configuration property to conditionally allow distributed
copies to avoid that expense and write directly to the target path would improve performance
considerably.
> [1] https://github.com/apache/hadoop/blob/release-3.2.0-RC1/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md#object-stores-vs-filesystems



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message