hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-16430) S3AFilesystem.delete to incrementally update s3guard with deletions
Date Fri, 23 Aug 2019 11:13:01 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-16430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914182#comment-16914182
] 

Steve Loughran commented on HADOOP-16430:
-----------------------------------------

Highlight: coding this has thrown up an issue with the current implementation.
the recursive delete only asks S3 for files, not DDB.

This ensures that incomplete DDB stores are not a problem, so that deletion does mostly recover
from any problems. However, it relies on LIST being complete, which we know is untrue as it
is eventually consistent.

Proposed: after deleting all files, we do a recursive list of what is left in S3Guard and
delete those too. That way files we know about but which the list missed will still get cleaned
up.

> S3AFilesystem.delete to incrementally update s3guard with deletions
> -------------------------------------------------------------------
>
>                 Key: HADOOP-16430
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16430
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0, 3.3.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: Screenshot 2019-07-16 at 22.08.31.png
>
>
> Currently S3AFilesystem.delete() only updates the delete at the end of a paged delete
operation. This makes it slow when there are many thousands of files to delete ,and increases
the window of vulnerability to failures
> Preferred
> * after every bulk DELETE call is issued to S3, queue the (async) delete of all entries
in that post.
> * at the end of the delete, await the completion of these operations.
> * inside S3AFS, also do the delete across threads, so that different HTTPS connections
can be used.
> This should maximise DDB throughput against tables which aren't IO limited.
> When executed against small IOP limited tables, the parallel DDB DELETE batches will
trigger a lot of throttling events; we should make sure these aren't going to trigger failures



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message