flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Galen Warren (Jira)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem
Date Mon, 03 May 2021 15:19:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338418#comment-17338418

Galen Warren commented on FLINK-19481:

Hi all, I'm the author of the other [PR|https://github.com/apache/flink/pull/15599] that relates
to Google Cloud Storage. [~xintongsong] has been working with me on this.

The main goal of my PR is to add support for the RecoverableWriter interface, so that one
can write to GCS via a StreamingFileSink. The file system support goes through the Hadoop
stack, as noted above, using Google's [cloud storage connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage].

I have not personally had problems using the GCS connector and the Hadoop stack – it seems
to write check/savepoints properly. I also use it to write job manager HA data to GCS, which
seems to work fine.

However, if we do want to support a native implementation in addition to the Hadoop-based
one, we could approach it similarly to what has been done for S3, i.e. have a shared base
project (flink-gs-fs-base?) and then projects for each of the implementations ( flink-gs-fs-hadoop
and flink-gs-fs-native?). The recoverable-writer code could go into the shared project so
that both of the implementations could use it (assuming that the native implementation doesn't
already have a recoverable-writer implementation).

I'll defer to the Flink experts on whether that's a worthwhile effort or not. At this point,
from my perspective, it wouldn't be that much work to rework the project structure to support


> Add support for a flink native GCS FileSystem
> ---------------------------------------------
>                 Key: FLINK-19481
>                 URL: https://issues.apache.org/jira/browse/FLINK-19481
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem, FileSystems
>    Affects Versions: 1.12.0
>            Reporter: Ben Augarten
>            Priority: Minor
>              Labels: auto-deprioritized-major
> Currently, GCS is supported but only by using the hadoop connector[1]
> The objective of this improvement is to add support for checkpointing to Google Cloud
Storage with the Flink File System,
> This would allow the `gs://` scheme to be used for savepointing and checkpointing. Long
term, it would be nice if we could use the GCS FileSystem as a source and sink in flink jobs
as well. 
> Long term, I hope that implementing a flink native GCS FileSystem will simplify usage
of GCS because the hadoop FileSystem ends up bringing in many unshaded dependencies.
> [1] [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]

This message was sent by Atlassian Jira

View raw message