hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitri Chmelev (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-16090) deleteUnnecessaryFakeDirectories() creates unnecessary delete markers in a versioned S3 bucket
Date Fri, 01 Feb 2019 02:41:00 GMT
Dmitri Chmelev created HADOOP-16090:
---------------------------------------

             Summary: deleteUnnecessaryFakeDirectories() creates unnecessary delete markers
in a versioned S3 bucket
                 Key: HADOOP-16090
                 URL: https://issues.apache.org/jira/browse/HADOOP-16090
             Project: Hadoop Common
          Issue Type: Bug
          Components: fs/s3
    Affects Versions: 2.8.1
            Reporter: Dmitri Chmelev


The fix to avoid calls to getFileStatus() for each path component in deleteUnnecessaryFakeDirectories()
([HADOOP-13164|https://issues.apache.org/jira/browse/HADOOP-13164]) results in accumulation
of delete markers in versioned S3 buckets. The above patch replaced getFileStatus() checks
with a single batch delete request formed by generating all ancestor keys formed from a given
path. Since the delete request is not checking for existence of fake directories, it will
create a delete marker for every path component that did not exist (or was previously deleted).
Note that issuing a DELETE request without specifying a version ID will always create a
new delete marker, even if one already exists ([AWS S3 Developer Guide|https://docs.aws.amazon.com/AmazonS3/latest/dev/RemDelMarker.html])

Since deleteUnnecessaryFakeDirectories() is called as a callback on successful writes and
on renames, delete markers accumulate rather quickly and their rate of accumulation is inversely
proportional to the depth of the path. In other words, directories closer to the root will have
more delete markers than the leaves.

This behavior negatively impacts performance of getFileStatus() operation when it has to issue
listObjects() request (especially v1) as the delete markers have to be examined when the request searches
for first current non-deleted version of an object following a given prefix.

I did a quick comparison against 3.x and the issue is still present: [https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2947|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2947]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message