hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Mackrory (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-16085) S3Guard: use object version or etags to protect against inconsistent read after replace/overwrite
Date Wed, 17 Apr 2019 15:58:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-16085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820236#comment-16820236
] 

Sean Mackrory commented on HADOOP-16085:
----------------------------------------

Left some feedback in-line on the pull-request (and for HADOOP-16221 too). Some more general
thoughts:
* Have been discussing with [~stevel@apache.org] whether or not the FileStatus -> S3AFileStatus
and schema changes should be separated out from the enforcement. I think the best argument
for that is that it's a smaller change to get older clients to notify newer clients of changes
whereas only the newer ones will enforce. The other factor mentioned is the desire for keeping
S3Guard relatively storage-agnostic, but I honestly just don't see how we can do that and
still have a robust solution. S3 is popular enough to warrant a custom solution that really
does fix all the holes. Personally, I think we should just keep this change together.
* I don't suppose there's an interface we can rely on to provide getETag() and getVersionId(),
is there? This is where Go's duck-typing would be nice so we could eliminate 2 (or more) or
the args to every constructor call. Not a big deal. I have a small to do list of other little
things to look into but as you'll see on the PR, the overwhelming majority of my feedback
is pretty mechanical. I think overall this is looking like a good solid patch.

> S3Guard: use object version or etags to protect against inconsistent read after replace/overwrite
> -------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-16085
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16085
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Ben Roling
>            Assignee: Ben Roling
>            Priority: Major
>         Attachments: HADOOP-16085-003.patch, HADOOP-16085_002.patch, HADOOP-16085_3.2.0_001.patch
>
>
> Currently S3Guard doesn't track S3 object versions.  If a file is written in S3A with
S3Guard and then subsequently overwritten, there is no protection against the next reader
seeing the old version of the file instead of the new one.
> It seems like the S3Guard metadata could track the S3 object version.  When a file is
created or updated, the object version could be written to the S3Guard metadata.  When a
file is read, the read out of S3 could be performed by object version, ensuring the correct
version is retrieved.
> I don't have a lot of direct experience with this yet, but this is my impression from
looking through the code.  My organization is looking to shift some datasets stored in HDFS
over to S3 and is concerned about this potential issue as there are some cases in our codebase
that would do an overwrite.
> I imagine this idea may have been considered before but I couldn't quite track down any
JIRAs discussing it.  If there is one, feel free to close this with a reference to it.
> Am I understanding things correctly?  Is this idea feasible?  Any feedback that could
be provided would be appreciated.  We may consider crafting a patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message