hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15191) Add Private/Unstable BulkDelete operations to supporting object stores for DistCP
Date Fri, 02 Feb 2018 17:55:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350727#comment-16350727

Steve Loughran commented on HADOOP-15191:

Aaron, thanks for the comments.

w.r.t  directories vs files, in a bulk S3 delete  we can't check each path up front for being
a directory, so if you start deleting paths which aren't there, refer to dirs, etc, things
get confused. The patch as is gets S3guard into trouble if you hand it a directory on the

I'm currently thinking "do I need to do this at all", based on those traces which show that
the file list for distcp is including all files under deleted directory trees. If we eliminate
that waste of effort, then we may not need this new API at all

Good: no changes to filesystems, speedup everywhere
Danger: I'd need to build up a datastructure in the distcp copy committer, one which, if it
goes OOM, breaks distcp workflows and leaves people who can phone me up unhappy.

I'm thinking of: 
binary tree of Path.hashCode() of all deleted directories; you look for the parent dir before
deleteing a file, for a dir you then add yourself to the hash whether you are executed or

Avoids keeping all the Path  structures around, needs an object with a long and two pointers
per ref, O(lg(directories)) on lookup/insert, and we could make the directory check combine
the lookup and the insert

I'll file a separate JIRA on there, again, reviews appreciated. Lets see how far that one
can get before worrying about bulk deletion, which will only benefit for the case of: directories
retained but some/many/all files removed from them. A feature whose need will become more
apparent if the next patch logs information about files vs dirs deleted

> Add Private/Unstable BulkDelete operations to supporting object stores for DistCP
> ---------------------------------------------------------------------------------
>                 Key: HADOOP-15191
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15191
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3, tools/distcp
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-15191-001.patch, HADOOP-15191-002.patch, HADOOP-15191-003.patch,
> Large scale DistCP with the -delete option doesn't finish in a viable time because of
the final CopyCommitter doing a 1 by 1 delete of all missing files. This isn't randomized
(the list is sorted), and it's throttled by AWS.
> If bulk deletion of files was exposed as an API, distCP would do 1/1000 of the REST calls,
so not get throttled.
> Proposed: add an initially private/unstable interface for stores, {{BulkDelete}} which
declares a page size and offers a {{bulkDelete(List<Path>)}} operation for the bulk

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message