hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Olson (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-16090) S3A Client to add explicit support for versioned stores
Date Fri, 27 Mar 2020 13:24:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068692#comment-17068692
] 

Andrew Olson commented on HADOOP-16090:
---------------------------------------

It seems perhaps there should be some general guidance somewhere about whether [Object Versioning|https://docs.aws.amazon.com/AmazonS3/latest/dev/ObjectVersioning.html]
is recommended for buckets used as S3A filesystems. I think the answer is "no, it is strongly
discouraged and may introduce unexpected performance issues or add operational complexity"
(e.g. accumulation of delete markers, deleting files not reducing bucket storage size unless
non-current removal lifecycle policy is setup, etc). Unless there are special considerations
(i.e. using versioning as a kind of replacement for "move to trash" functionality, since the
"move to trash" performs much slower for S3A than HDFS), it would seem generally best to not
enable versioning.

[https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html] is probably
the best place if that were to happen. I don't see anywhere where it discusses this topic
deeply, with the various pros/cons evaluated.

> S3A Client to add explicit support for versioned stores
> -------------------------------------------------------
>
>                 Key: HADOOP-16090
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16090
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.1
>            Reporter: Dmitri Chmelev
>            Assignee: Steve Loughran
>            Priority: Minor
>
> The fix to avoid calls to getFileStatus() for each path component in deleteUnnecessaryFakeDirectories()
(HADOOP-13164) results in accumulation of delete markers in versioned S3 buckets. The above
patch replaced getFileStatus() checks with a single batch delete request formed by generating
all ancestor keys formed from a given path. Since the delete request is not checking for existence
of fake directories, it will create a delete marker for every path component that did not
exist (or was previously deleted). Note that issuing a DELETE request without specifying
a version ID will always create a new delete marker, even if one already exists ([AWS S3
Developer Guide|https://docs.aws.amazon.com/AmazonS3/latest/dev/RemDelMarker.html])
> Since deleteUnnecessaryFakeDirectories() is called as a callback on successful writes
and on renames, delete markers accumulate rather quickly and their rate of accumulation is
inversely proportional to the depth of the path. In other words, directories closer to the
root will have more delete markers than the leaves.
> This behavior negatively impacts performance of getFileStatus() operation when it has
to issue listObjects() request (especially v1) as the delete markers have to be examined when
the request searches for first current non-deleted version of an object following a given
prefix.
> I did a quick comparison against 3.x and the issue is still present: [https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2947|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2947]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message