beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From boyu...@apache.org
Subject [beam] branch master updated: Add splittable dofn as the recommended way of building connectors.
Date Tue, 01 Dec 2020 18:33:55 GMT
This is an automated email from the ASF dual-hosted git repository.

boyuanz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git


The following commit(s) were added to refs/heads/master by this push:
     new 4c51569  Add splittable dofn as the recommended way of building connectors.
     new 3de140f  Merge pull request #13227 from [BEAM-10480] Add splittable dofn as the recommended
way of building connectors.
4c51569 is described below

commit 4c51569d3b972e2271efcc520a48ccb1bd20c9be
Author: Boyuan Zhang <boyuanz@google.com>
AuthorDate: Thu Oct 29 15:13:35 2020 -0700

    Add splittable dofn as the recommended way of building connectors.
---
 .../en/documentation/io/developing-io-java.md      |  3 +
 .../en/documentation/io/developing-io-overview.md  | 80 +++++++++++++---------
 .../en/documentation/io/developing-io-python.md    |  3 +
 3 files changed, 53 insertions(+), 33 deletions(-)

diff --git a/website/www/site/content/en/documentation/io/developing-io-java.md b/website/www/site/content/en/documentation/io/developing-io-java.md
index 7de2024..7836a3c 100644
--- a/website/www/site/content/en/documentation/io/developing-io-java.md
+++ b/website/www/site/content/en/documentation/io/developing-io-java.md
@@ -17,6 +17,9 @@ limitations under the License.
 -->
 # Developing I/O connectors for Java
 
+**IMPORTANT:** Use ``Splittable DoFn`` to develop your new I/O. For more details, read the
+[new I/O connector overview](/documentation/io/developing-io-overview/).
+
 To connect to a data store that isn’t supported by Beam’s existing I/O
 connectors, you must create a custom I/O connector that usually consist of a
 source and a sink. All Beam sources and sinks are composite transforms; however,
diff --git a/website/www/site/content/en/documentation/io/developing-io-overview.md b/website/www/site/content/en/documentation/io/developing-io-overview.md
index 0ea507f..c8e0482 100644
--- a/website/www/site/content/en/documentation/io/developing-io-overview.md
+++ b/website/www/site/content/en/documentation/io/developing-io-overview.md
@@ -46,33 +46,32 @@ are the recommended steps to get started:
 For **bounded (batch) sources**, there are currently two options for creating a
 Beam source:
 
+1. Use `Splittable DoFn`.
+
 1. Use `ParDo` and `GroupByKey`.
 
-1. Use the `Source` interface and extend the `BoundedSource` abstract subclass.
 
-`ParDo` is the recommended option, as implementing a `Source` can be tricky. See
-[When to use the Source interface](#when-to-use-source) for a list of some use
-cases where you might want to use a `Source` (such as
-[dynamic work rebalancing](/blog/2016/05/18/splitAtFraction-method.html)).
+`Splittable DoFn` is the recommended option, as it's the most recent source framework for
both
+bounded and unbounded sources. This is meant to replace the `Source` APIs(
+[BoundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/BoundedSource.html)
and
+[UnboundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/UnboundedSource.html))
+in the new system. Read
+[Splittable DoFn Programming Guide](/learn/programming-guide/#splittable-dofns) for how to
write one
+Splittable DoFn. For more information, see the
+[roadmap for multi-SDK connector efforts](/roadmap/connectors-multi-sdk/).
 
-(Java only) For **unbounded (streaming) sources**, you must use the `Source`
-interface and extend the `UnboundedSource` abstract subclass. `UnboundedSource`
-supports features that are useful for streaming pipelines, such as
-checkpointing.
+For Java and Python **unbounded (streaming) sources**, you must use the `Splittable DoFn`,
which
+supports features that are useful for streaming pipelines, including checkpointing, controlling
+watermark, and tracking backlog.
 
-Splittable DoFn is a new sources framework that is under development and will
-replace the other options for developing bounded and unbounded sources. For more
-information, see the
-[roadmap for multi-SDK connector efforts](/roadmap/connectors-multi-sdk/).
 
-### When to use the Source interface {#when-to-use-source}
+### When to use the Splittable DoFn interface {#when-to-use-splittable-dofn}
 
-If you are not sure whether to use `Source`, feel free to email the [Beam dev
-mailing list](/get-started/support) and we can discuss the
-specific pros and cons of your case.
+If you are not sure whether to use `Splittable DoFn`, feel free to email the
+[Beam dev mailing list](/get-started/support) and we can discuss the specific pros and cons
of your
+case.
 
-In some cases, implementing a `Source` might be necessary or result in better
-performance:
+In some cases, implementing a `Splittable DoFn` might be necessary or result in better performance:
 
 * **Unbounded sources:** `ParDo` does not work for reading from unbounded
   sources.  `ParDo` does not support checkpointing or mechanisms like de-duping
@@ -90,22 +89,40 @@ performance:
   jobs. Depending on your data source, dynamic work rebalancing might not be
   possible.
 
-* **Splitting into parts of particular size recommended by the runner:** `ParDo`
-  does not receive `desired_bundle_size` as a hint from runners when performing
-  initial splitting.
+* **Splitting initially to increase parallelism:** `ParDo`
+  does not have the ability to perform initial splitting.
 
 For example, if you'd like to read from a new file format that contains many
 records per file, or if you'd like to read from a key-value store that supports
 read operations in sorted key order.
 
-### Source lifecycle {#source}
-Here is a sequence diagram that shows the lifecycle of the Source during
- the execution of the Read transform of an IO. The comments give useful
- information to IO developers such as the constraints that
- apply to the objects or particular cases such as streaming mode.
-
- <!-- The source for the sequence diagram can be found in the the SVG resource. -->
-![This is a sequence diagram that shows the lifecycle of the Source](/images/source-sequence-diagram.svg)
+### I/O examples using SDFs
+**Java Examples**
+
+* [Kafka](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java#L118):
+An I/O connector for [Apache Kafka](https://kafka.apache.org/)
+(an open-source distributed event streaming platform).
+* [Watch](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L787):
+Uses a polling function producing a growing set of outputs for each input until a per-input
+termination condition is met.
+* [Parquet](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L365):
+An I/O connector for [Apache Parquet](https://parquet.apache.org/)
+(an open-source columnar storage format).
+* [HL7v2](https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/healthcare/HL7v2IO.java#L493):
+An I/O connector for HL7v2 messages (a clinical messaging format that provides data about
events
+that occur inside an organization) part of
+[Google’s Cloud Healthcare API](https://cloud.google.com/healthcare).
+* [BoundedSource wrapper](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L248):
+A wrapper which converts an existing [BoundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/BoundedSource.html)
+implementation to a splittable DoFn.
+* [UnboundedSource wrapper](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L432):
+A wrapper which converts an existing [UnboundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/UnboundedSource.html)
+implementation to a splittable DoFn.
+
+**Python Examples**
+* [BoundedSourceWrapper](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/python/apache_beam/io/iobase.py#L1375):
+A wrapper which converts an existing [BoundedSource](https://beam.apache.org/releases/pydoc/current/apache_beam.io.iobase.html#apache_beam.io.iobase.BoundedSource)
+implementation to a splittable DoFn.
 
 ### Using ParDo and GroupByKey
 
@@ -157,7 +174,6 @@ example:
     cannot be parallelized. In this case, the `ParDo` would open the file and
     read in sequence, producing a `PCollection` of records from the file.
 
-
 ## Sinks
 
 To create a Beam sink, we recommend that you use a `ParDo` that writes the
@@ -169,8 +185,6 @@ For **file-based sinks**, you can use the `FileBasedSink` abstraction
that is
 provided by both the Java and Python SDKs. See our language specific
 implementation guides for more details:
 
-* [Developing I/O connectors for Java](/documentation/io/developing-io-java/)
-* [Developing I/O connectors for Python](/documentation/io/developing-io-python/)
 
 
 
diff --git a/website/www/site/content/en/documentation/io/developing-io-python.md b/website/www/site/content/en/documentation/io/developing-io-python.md
index 039b633..7c7705b 100644
--- a/website/www/site/content/en/documentation/io/developing-io-python.md
+++ b/website/www/site/content/en/documentation/io/developing-io-python.md
@@ -19,6 +19,9 @@ limitations under the License.
 -->
 # Developing I/O connectors for Python
 
+**IMPORTANT:** Please use ``Splittable DoFn`` to develop your new I/O. For more details,
please read
+the [new I/O connector overview](/documentation/io/developing-io-overview/).
+
 To connect to a data store that isn’t supported by Beam’s existing I/O
 connectors, you must create a custom I/O connector that usually consist of a
 source and a sink. All Beam sources and sinks are composite transforms; however,


Mime
View raw message