beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [beam] branch master updated: Add a small announcement for Splittable DoFn.
Date Thu, 10 Dec 2020 21:49:53 GMT
This is an automated email from the ASF dual-hosted git repository.

boyuanz pushed a commit to branch master
in repository

The following commit(s) were added to refs/heads/master by this push:
     new 379d796  Add a small announcement for Splittable DoFn.
     new c4af7f9  Merge pull request #13456 from [BEAM-10480] Add a small announcement for
Splittable DoFn.
379d796 is described below

commit 379d7967444db0b7f5ab932fe331f9e0dea83d63
Author: Boyuan Zhang <>
AuthorDate: Tue Dec 1 17:42:26 2020 -0800

    Add a small announcement for Splittable DoFn.
 .../en/blog/       | 91 ++++++++++++++++++++++
 1 file changed, 91 insertions(+)

diff --git a/website/www/site/content/en/blog/ b/website/www/site/content/en/blog/
new file mode 100644
index 0000000..5e73a8c
--- /dev/null
+++ b/website/www/site/content/en/blog/
@@ -0,0 +1,91 @@
+title:  "Splittable DoFn in Apache Beam is Ready to Use"
+date:   2020-12-14 00:00:01 -0800
+  - blog
+  - /blog/2020/12/14/splittable-do-fn-is-available.html
+  - boyuanzz
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+See the License for the specific language governing permissions and
+limitations under the License.
+We are pleased to announce that Splittable DoFn (SDF) is ready for use in the Beam Python,
+and Go SDKs for versions 2.25.0 and later.
+In 2017, [Splittable DoFn Blog Post]( proposed
+to build [Splittable DoFn]( APIs as the new recommended
way of
+building I/O connectors. Splittable DoFn is a generalization of `DoFn` that gives it the
+capabilities of `Source` while retaining `DoFn`'s syntax, flexibility, modularity, and ease
+coding. Thus, it becomes much easier to develop complex I/O connectors with simpler and reusable
+SDF has three advantages over the existing `UnboundedSource` and `BoundedSource`:
+* SDF provides a unified set of APIs to handle both unbounded and bounded cases.
+* SDF enables reading from source descriptors dynamically.
+  - Taking KafkaIO as an example, within `UnboundedSource`/`BoundedSource` API, you must
+  the topic and partition you want to read from during pipeline construction time. There
is no way
+  for `UnboundedSource`/`BoundedSource` to accept topics and partitions as inputs during
+  time. But it's built-in to SDF.
+* SDF fits in as any node on a pipeline freely with the ability of splitting.
+  - `UnboundedSource`/`BoundedSource` has to be the root node of the pipeline to gain performance
+  benefits from splitting strategies, which limits many real-world usages. This is no longer
a limit
+  for an SDF.
+As SDF is now ready to use with all the mentioned improvements, it is the recommended
+way to build the new I/O connectors. Try out building your own Splittable DoFn by following
+[programming guide](
+have provided tonnes of common utility classes such as common types of `RestrictionTracker`
+`WatermarkEstimator` in Beam SDK, which will help you onboard easily. As for the existing
+connectors, we have wrapped `UnboundedSource` and `BoundedSource` implementations into Splittable
+DoFns, yet we still encourage developers to convert `UnboundedSource`/`BoundedSource` into
+Splittable DoFn implementation to gain more performance benefits.
+Many thanks to every contributor who brought this highly anticipated design into the data
+world. We are really excited to see that users benefit from SDF.
+Below are some real-world SDF examples for you to explore.
+## Real world Splittable DoFn examples
+**Java Examples**
+* [Kafka](
+An I/O connector for [Apache Kafka](
+(an open-source distributed event streaming platform).
+* [Watch](
+Uses a polling function producing a growing set of outputs for each input until a per-input
+termination condition is met.
+* [Parquet](
+An I/O connector for [Apache Parquet](
+(an open-source columnar storage format).
+* [HL7v2](
+An I/O connector for HL7v2 messages (a clinical messaging format that provides data about
+that occur inside an organization) part of
+[Google’s Cloud Healthcare API](
+* [BoundedSource wrapper](
+A wrapper which converts an existing [BoundedSource](
+implementation to a splittable DoFn.
+* [UnboundedSource wrapper](
+A wrapper which converts an existing [UnboundedSource](
+implementation to a splittable DoFn.
+**Python Examples**
+* [BoundedSourceWrapper](
+A wrapper which converts an existing [BoundedSource](
+implementation to a splittable DoFn.
+**Go Examples**
+ *  [textio.ReadSdf](
implements reading from text files using a splittable DoFn.

View raw message