hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Fabbri (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-16221) S3Guard: fail write that doesn't update metadata store
Date Wed, 01 May 2019 04:21:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-16221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830848#comment-16830848

Aaron Fabbri commented on HADOOP-16221:

Sorry I didn't see this until now. Thanks for the contribution and documentation.

I'll give some background on the existing logic at least.

As you can see we generally chose to fall back to raw s3 behavior when there are failures
with the Metadata Store. S3Guard was targeted to existing S3 customers so that made sense
to me.

The MetadataStore is conceptually a "trailling log of metadata changes made to S3". You can
also think of it as a consistency hint. There are are few guarantees with the semantics that
S3 exposes (e.g. no upper bound on eventual consistency time–think about what that means
for your write. You need a write journal w/ fast scalable queries and transactions to really
solve this but you'd be better off ditching S3 for a real storage system IMO..).

We are logging things that already happened in S3. With error semantics, if you mutate s3
but fail to mutate MetadataStore I thought you should either (1) roll back transaction and
return failure or (2) don't rollback and return success. #1 is seen as too complex and slow
to do right above S3 but #2 returns success after successful mutation of S3 state.

So this was an intentional decision, not to return failure when you successfully write a file
to S3.  As you note, there is no roll back.

I can see the argument for doing it the new way as well.. My bias is that it is important
to know whether or not you actually wrote data to the backing store. Spent some time in finance
(the wrong write can cost you) and storage companies which sort of formed my bias.

Essentially both options are "wrong". Before, we'd return success but give up consistency
hints on that path, now we return failure even though we wrote the data to S3.

In lieu of a real storage system, I think having this well-documented and configurable is
fine. The retries on a MetadataStore are pretty robust where failures should be pretty rare.

Hope this background was interesting. Feel free to email me if you ever need a review. My
email filters tend to catch a lot of stuff that I should have noticed.


> S3Guard: fail write that doesn't update metadata store
> ------------------------------------------------------
>                 Key: HADOOP-16221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16221
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Ben Roling
>            Assignee: Ben Roling
>            Priority: Major
>             Fix For: 3.3.0
> Right now, a failure to write to the S3Guard metadata store (e.g. DynamoDB) is [merely
It does not fail the S3AFileSystem write operation itself. As such, the writer has no idea
that anything went wrong. The implication of this is that S3Guard doesn't always provide the
consistency it advertises.
> For example [this article|https://blog.cloudera.com/blog/2017/08/introducing-s3guard-s3-consistency-for-apache-hadoop/]
> {quote}If a Hadoop S3A client creates or moves a file, and then a client lists its directory,
that file is now guaranteed to be included in the listing.
> {quote}
> Unfortunately, this is sort of untrue and could result in exactly the sort of problem
S3Guard is supposed to avoid:
> {quote}Missing data that is silently dropped. Multi-step Hadoop jobs that depend on output
of previous jobs may silently omit some data. This omission happens when a job chooses which
files to consume based on a directory listing, which may not include recently-written items.
> {quote}
> Imagine the typical multi-job Hadoop processing pipeline. Job 1 runs and succeeds, but
one (or more) S3Guard metadata write failed under the covers. Job 2 picks up the output directory
from Job 1 and runs its processing, potentially seeing an inconsistent listing, silently missing
some of the Job 1 output files.
> S3Guard should at least provide a configuration option to fail if the metadata write
fails. It seems even ideally this should be the default?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message