hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Fabbri (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15489) S3Guard to self update on directory listings of S3
Date Fri, 25 May 2018 06:07:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490281#comment-16490281

Aaron Fabbri commented on HADOOP-15489:

Look at S3Guard#dirListingUnion().  We used to always update MetadataStore at the end of listStatus().
Later we changed it to only happen when fs.s3a.metadatastore.authoritative = true.

If you set this to true you will always update MetadataStore at end of listStatus(), but keep
in mind that the short-circuit listings are not implemented for Dynamo MS yet ([~gabor.bota]
is working towards that though).

There are other listing APIs that don't do this, of course.

> S3Guard to self update on directory listings of S3
> --------------------------------------------------
>                 Key: HADOOP-15489
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15489
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.1.0
>         Environment: s3guard
>            Reporter: Steve Loughran
>            Priority: Major
> S3Guard updates its table on a getFileStatus call, but not on a directory listing.
> While this makes directory listings faster (no need to push out an update), it slows
down subsequent queries of the files, such as a sequence of:
> {code}
> statuses = s3a.listFiles(dir)
> for (status: statuses) {
>   if (status.isFile) {
>       try(is = s3a.open(status.getPath())) {
>         ... do something
>       }
> }
> {code}
> this is because the open() is doing the getFileStatus check, even after the listing.
> Updating the DDB tables after a listing would give those reads a speedup, albeit at the
expense of initiating a (bulk) update in the list call. Of course, we could consider making
that async, though that design (essentially a write-buffer) would require the buffer to be
checked in the reads too. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message