flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-9749) Rework Bucketing Sink
Date Thu, 09 Aug 2018 10:11:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Till Rohrmann updated FLINK-9749:
    Fix Version/s:     (was: 1.6.0)

> Rework Bucketing Sink
> ---------------------
>                 Key: FLINK-9749
>                 URL: https://issues.apache.org/jira/browse/FLINK-9749
>             Project: Flink
>          Issue Type: New Feature
>          Components: Streaming Connectors
>            Reporter: Stephan Ewen
>            Assignee: Kostas Kloudas
>            Priority: Major
>             Fix For: 1.7.0
> The BucketingSink has a series of deficits at the moment.
> Due to the long list of issues, I would suggest to add a new StreamingFileSink with a
new and cleaner design
> h3. Encoders, Parquet, ORC
>  - It only efficiently supports row-wise data formats (avro, jso, sequence files.
>  - Efforts to add (columnar) compression for blocks of data is inefficient, because blocks
cannot span checkpoints due to persistence-on-checkpoint.
>  - The encoders are part of the \{{flink-connector-filesystem project}}, rather than
in orthogonal formats projects. This blows up the dependencies of the \{{flink-connector-filesystem
project}} project. As an example, the rolling file sink has dependencies on Hadoop and Avro,
which messes up dependency management.
> h3. Use of FileSystems
>  - The BucketingSink works only on Hadoop's FileSystem abstraction not support Flink's
own FileSystem abstraction and cannot work with the packaged S3, maprfs, and swift file systems
>  - The sink hence needs Hadoop as a dependency
>  - The sink relies on "trying out" whether truncation works, which requires write access
to the users working directory
>  - The sink relies on enumerating and counting files, rather than maintaining its own
state, making less efficient
> h3. Correctness and Efficiency on S3
>  - The BucketingSink relies on strong consistency in the file enumeration, hence may
work incorrectly on S3.
>  - The BucketingSink relies on persisting streams at intermediate points. This is not
working properly on S3, hence there may be data loss on S3.
> h3. .valid-length companion file
>  - The valid length file makes it hard for consumers of the data and should be dropped
> We track this design in a series of sub issues.

This message was sent by Atlassian JIRA

View raw message