hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15625) S3A input stream to use etags to detect changed source files
Date Mon, 18 Feb 2019 16:18:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771191#comment-16771191
] 

Steve Loughran commented on HADOOP-15625:
-----------------------------------------

thanks, ben, not had a chance to. running regressions test related to HADOOP-15843 in one
window; backporting HADOOP-15281 to hadoop 3.1+, backporting a lot of ABFS changes to some
other branch, oh, and when I get a chance doing my own coding (HADOOP-16068)

please don't take this personally.

One thing I'have been wondering is how third party stores are going to handle that modified
header, and what could we do here. Ignoring the "this adds even more tests and documentation"
problem, I could imagine multiple options here for some fs.s3a.etag.checks

* server: we do it server-side
* client: do it on the client, which fails on a returned value. deal with stores which don't
support the etag
* warn: simply downgrade to warn
* off: don't check

what do you think?

Oh, and we can add more metrics to the org.apache.hadoop.fs.s3a.S3AInstrumentation.InputStreamStatistics
class to count number of times an inconsistency was observed. this could help monitoring/debugging
across an entire cluster


> S3A input stream to use etags to detect changed source files
> ------------------------------------------------------------
>
>                 Key: HADOOP-15625
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15625
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Brahma Reddy Battula
>            Assignee: Brahma Reddy Battula
>            Priority: Major
>         Attachments: HADOOP-15625-001.patch, HADOOP-15625-002.patch, HADOOP-15625-003.patch
>
>
> S3A input stream doesn't handle changing source files any better than the other cloud
store connectors. Specifically: it doesn't noticed it has changed, caches the length from
startup, and whenever a seek triggers a new GET, you may get one of: old data, new data, and
even perhaps go from new data to old data due to eventual consistency.
> We can't do anything to stop this, but we could detect changes by
> # caching the etag of the first HEAD/GET (we don't get that HEAD on open with S3Guard,
BTW)
> # on future GET requests, verify the etag of the response
> # raise an IOE if the remote file changed during the read.
> It's a more dramatic failure, but it stops changes silently corrupting things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message