beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (BEAM-92) Data-dependent sinks
Date Wed, 14 Jun 2017 04:06:00 GMT


ASF GitHub Bot commented on BEAM-92:

GitHub user reuvenlax opened a pull request:

    [BEAM-92] Allow value-dependent files in FileBasedSink

    This is modeled off of the pattern used in BigQueryIO. The user can provide a DynamicDestinations
class which can map the input into a user-defined destination type, and the destination into
a FilenamePolicy. Some refactoring of FilenamePolicy is done as well to make this more useful
- e.g. allowing FilenamePolicy to pick its own base directory instead of having it passed
in from the sink (this is marked as @Experimental, so such changes are allowed).
    Not yet in this PR: we should allow the sinks to take user-defined types for mapping.
We should also provide a convenience method for dynamic output using the DefaultFilenamePolicy
(e.g. passing in KVs).
    R: @jkff 

You can merge this pull request into a Git repository by running:

    $ git pull dynamic_file_based_sink

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3356
commit 531a013f6f1e6865566f648d3486d77a7de20369
Author: Reuven Lax <>
Date:   2017-06-10T00:11:32Z

    Add DynamicDestinations support to FileBasedSink

commit 616f9ede5afa7e771f50595bcd290b907744d57f
Author: Reuven Lax <>
Date:   2017-06-10T18:23:32Z

    More fixups.

commit c6855fa130c86c94de74c36976a667de7f8d6219
Author: Reuven Lax <>
Date:   2017-06-12T16:42:26Z

    Fix some tests.

commit f4d4bf5f9238f7e9627363daa14158cdcc68d64b
Author: Reuven Lax <>
Date:   2017-06-13T21:34:08Z

    Remove baseDirectory parameter from FilenamePolicy. The FilenamePolicy can choose it's
own base directory.

commit 219a3d649428f673bf35f043990d61000d5fd64e
Author: Reuven Lax <>
Date:   2017-06-14T00:06:01Z

    No longer pass in "extension" to FilenamePolicy. Instead we pass in file metadata class,
including a getSuggestedExtension method that is based on the compression type.

commit 1c5ed9e39e891b4418f597d531f442806b4f22b8
Author: Reuven Lax <>
Date:   2017-06-14T00:41:22Z

    Add a withTempDirectory override to TextIO and AvroIO. This way users aren't forced to
provide a dummy file prefix, just to specify a temp directory.

commit 14bf19544801ce0791ffd2f95574ba9b5b9d33c6
Author: Reuven Lax <>
Date:   2017-06-14T00:49:34Z

    Fix validation code.

commit 93727530ac83e70570f6da1214fa20c5871b97e6
Author: Reuven Lax <>
Date:   2017-06-14T00:52:08Z

    Fix fix javadoc.

commit da6b0e648fac9d9def26c37aa755b73a5a3f9cef
Author: Reuven Lax <>
Date:   2017-06-14T02:29:00Z

    Fix CheckStyle violations.

commit d2624f729cb7f7dc49f9dda65f6db7744f21e3c6
Author: Reuven Lax <>
Date:   2017-06-14T04:01:26Z

    Fix some failures.


> Data-dependent sinks
> --------------------
>                 Key: BEAM-92
>                 URL:
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core
>            Reporter: Eugene Kirpichov
>            Assignee: Reuven Lax
> Current sink API writes all data to a single destination, but there are many use cases
where different pieces of data need to be routed to different destinations where the set of
destinations is data-dependent (so can't be implemented with a Partition transform).
> One internally discussed proposal was an API of the form:
> {code}
> PCollection<Void> PCollection<T>.apply(
>     Write.using(DoFn<T, SinkT> where,
>                 MapFn<SinkT, WriteOperation<WriteResultT, T>> how)
> {code}
> so an item T gets written to a destination (or multiple destinations) determined by "where";
and the writing strategy is determined by "how" that produces a WriteOperation (current API
- global init/write/global finalize hooks) for any given destination.
> This API also has other benefits:
> * allows the SinkT to be computed dynamically (in "where"), rather than specified at
pipeline construction time
> * removes the necessity for a Sink class entirely
> * is sequenceable w.r.t. downstream transforms (you can stick transforms onto the returned
PCollection<Void>, while the current returns a PDone)

This message was sent by Atlassian JIRA

View raw message