hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13164) Optimize S3AFileSystem::deleteUnnecessaryFakeDirectories
Date Mon, 23 May 2016 10:17:12 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296193#comment-15296193

Steve Loughran commented on HADOOP-13164:

The goal of the call is to eliminate upstream pseudo-directory blobs. I fear removing it would
do bad things.

But if it is called after every file is written, it will be expensive, especially as there
is {{getStatus()}} in there (2 x {{getObjectMetadata()}} + 1 x {{listObjects()}}) , plus the
{{deleteObjects()}} call. As this goes up the tree, the cost will be O(depth)

Given that after a file has just been written, it is known that there is a child of any directory
(i.e. it is non-empty), then you don't need to check so much. You look for the existence of
a path, and if there: delete. 

More deviously, you could say "delete the path without checking to see if it exists". If it's
not there, a failed delete is harmless. That'd still be O(depth), but one S3 call, rather
than 3 or 4.

And, once you go down that path, you could say "queue up a delete for all parent paths and
fire them off in one go", going from O(depth) to O(1). 

Even better, you could maybe even do that asynchronously. I'd worry a bit there about race
conditions between the current thread and process, but given this is just a cleanup, it might
be safe —and I don't see it being any worse race-wise than what exists today, except now
it may be more visible to a single thread.

That would need very, very, careful testing. The one thing nobody wants is an over-zealous
delete operation to lose data.

> Optimize S3AFileSystem::deleteUnnecessaryFakeDirectories
> --------------------------------------------------------
>                 Key: HADOOP-13164
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13164
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Rajesh Balamohan
>            Priority: Minor
> https://github.com/apache/hadoop/blob/27c4e90efce04e1b1302f668b5eb22412e00d033/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L1224
> deleteUnnecessaryFakeDirectories is invoked in S3AFileSystem during rename and on outputstream
close() to purge any fake directories. Depending on the nesting in the folder structure, it
might take a lot longer time as it invokes getFileStatus multiple times.  Instead, it should
be able to break out of the loop once a non-empty directory is encountered. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message