flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-5706) Implement Flink's own S3 filesystem
Date Sat, 04 Mar 2017 12:20:45 GMT

    [ https://issues.apache.org/jira/browse/FLINK-5706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895648#comment-15895648

Steve Loughran commented on FLINK-5706:

Stefan, I don't think you appreciate how hard it is to do this.

I will draw your attention to all the features coming in Hadoop 2.8, HADOOP-11694, including
seek-optimised input streams, disk/heap/byte block buffered uploads, support for encryption,
optimisation of all requests HTTPS call by HTTPS call.

Then there's the todo list for later HADOOP-13204. One aspect of this, HADOOP-13345, s3guard
uses dynamo DB for that consistent view of metadata, and in HADOOP-13786, something to direct
commits to s3 which supports speculation and fault tolerance. 

These are all the things you get to replicate, along with the scale tests, which do find things,
as HADOOP-14028 showed up on 70GB writes, the various intermittent failures you don't see
often but cause serious problems when they do: example, the final POST of a multipart PUT
doesn't do retries, you have to yourself. After you find the problems.

As a sibling project, you are free to lift the entirety of the s3a code, along with all the
tests it includes. But you then take on the burden of maintaining it, fielding support calls,
doing your entire integration test work yourself, performance turning. Did [I mention testing?|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md].
We have put a lot of effort in there. You have to balance remote test runs with scalable performance
and affordability, where reusing amazon's own datasets is the secret, our own TCP-DS datasets,
and running tests in-EC2 to really stress things.

This is not trivial, even if you lift the latest branch-2 code. We are still finding bugs
there, ones that surface in the field. We may have a broader set of downstream things to deal
with: distcp, mapreduce, hive., spark, even flink, but that helps us with the test reports
(We keep an eye on JIRAs and stack overflow for the word "s3a"), and the different deployment

Please, do not take this on lightly.

Returning to your example above, 
# it's not just that the {{exists()/HEAD}} probe can take time to respond, it is that the
directory listing lags the direct {{HEAD object}} call; even if the exists() check returns
404, a LIST operation may still list the entry. And because the front end load balancers cache
things, the code deleting the object may get a 404 indicating that the object is gone, *there
is no guarantee that a different caller will not get a 200*. 
# You may even get it in the same process, though if your calls are using a thread pool of
keep-alive HTTP1.1 calls and all calls are on the same TCP connection, you'll be hitting the
same load balancer and so get the cached 404. Because yes, load balancers cache 404 entries,
meaning you don't even get create consistency if you do a check first.
# S3 doesn't have RAW consistency. It now has create consistency across all regions (yes,
for a long time it had different behaviours on US-East vs the others) provided you don't do
a HEAD first.
# You don't get PUT-over-PUT  consistency, DELETE consistency, and metadata queries invariably
lag the object state, even on create.
# there is no such thing as `rename()`, merely a COPY of approx 6-10MB/s, so being O(data)
and non atomic.
# if you are copying atop objects with the same name, you hit update consistency, for which
there are no guarantees. Again, different callers may see different results, irrespective
of call ordering, and listing will lag creation.

What you have seen so far is "demo scale" behaviours over a reused HTTP1/1 thread against
the same load balancer. You cannot extrapolate from what works there with what offers guaranteed
outcomes on large-scale operations with production data across multiple clusters, except for
the special case "if it doesn't work here it won't magically work in production"

> Implement Flink's own S3 filesystem
> -----------------------------------
>                 Key: FLINK-5706
>                 URL: https://issues.apache.org/jira/browse/FLINK-5706
>             Project: Flink
>          Issue Type: New Feature
>          Components: filesystem-connector
>            Reporter: Stephan Ewen
> As part of the effort to make Flink completely independent from Hadoop, Flink needs its
own S3 filesystem implementation. Currently Flink relies on Hadoop's S3a and S3n file systems.
> An own S3 file system can be implemented using the AWS SDK. As the basis of the implementation,
the Hadoop File System can be used (Apache Licensed, should be okay to reuse some code as
long as we do a proper attribution).

This message was sent by Atlassian JIRA

View raw message