beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From al...@apache.org
Subject [1/3] beam-site git commit: Move extension README files to website
Date Wed, 10 May 2017 04:25:41 GMT
Repository: beam-site
Updated Branches:
  refs/heads/asf-site 47ad18557 -> acd643afb


Move extension README files to website


Project: http://git-wip-us.apache.org/repos/asf/beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam-site/commit/8ed3f247
Tree: http://git-wip-us.apache.org/repos/asf/beam-site/tree/8ed3f247
Diff: http://git-wip-us.apache.org/repos/asf/beam-site/diff/8ed3f247

Branch: refs/heads/asf-site
Commit: 8ed3f247b8600b7651cffff5186ba145473e97ed
Parents: 47ad185
Author: Ahmet Altay <altay@google.com>
Authored: Tue May 9 17:47:01 2017 -0700
Committer: Ahmet Altay <altay@google.com>
Committed: Tue May 9 21:21:25 2017 -0700

----------------------------------------------------------------------
 src/documentation/runners/flink.md        |  2 +-
 src/documentation/sdks/java-extensions.md | 59 ++++++++++++++++++++++++++
 src/documentation/sdks/java.md            |  8 ++++
 3 files changed, 68 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/beam-site/blob/8ed3f247/src/documentation/runners/flink.md
----------------------------------------------------------------------
diff --git a/src/documentation/runners/flink.md b/src/documentation/runners/flink.md
index f844c6b..685917e 100644
--- a/src/documentation/runners/flink.md
+++ b/src/documentation/runners/flink.md
@@ -138,7 +138,7 @@ When executing your pipeline with the Flink Runner, you can set these
pipeline o
 </tr>
 </table>
 
-See the reference documentation for the  <span class="language-java">[FlinkPipelineOptions]({{
site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/flink/FlinkPipelineOptions.html)</span><span
class="language-py">[PipelineOptions](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/pipeline_options.py)</span>
interface (and its subinterfaces) for the complete list of pipeline configuration options.
+See the reference documentation for the  <span class="language-java">[FlinkPipelineOptions]({{
site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/flink/FlinkPipelineOptions.html)</span><span
class="language-py">[PipelineOptions](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/options/pipeline_options.py)</span>
interface (and its subinterfaces) for the complete list of pipeline configuration options.
 
 ## Additional information and caveats
 

http://git-wip-us.apache.org/repos/asf/beam-site/blob/8ed3f247/src/documentation/sdks/java-extensions.md
----------------------------------------------------------------------
diff --git a/src/documentation/sdks/java-extensions.md b/src/documentation/sdks/java-extensions.md
new file mode 100644
index 0000000..a4694af
--- /dev/null
+++ b/src/documentation/sdks/java-extensions.md
@@ -0,0 +1,59 @@
+---
+layout: default
+title: "Beam Java SDK Extensions"
+permalink: /documentation/sdks/java-extensions/
+---
+# Apache Beam Java SDK Extensions
+
+## <a name="join-library"></a>Join-library
+
+Join-library provides inner join, outer left join, and outer right join functions. The aim
+is to simplify the most common cases of join to a simple function call.
+
+The functions are generic and support joins of any Beam-supported types.
+Input to the join functions are `PCollections` of `Key` / `Value`s. Both
+the left and right `PCollection`s need the same type for the key. All the join
+functions return a `Key` / `Value` where `Key` is the join key and value is
+a `Key` / `Value` where the key is the left value and right is the value.
+
+For outer joins, the user must provide a value that represents `null` because `null`
+cannot be serialized.
+
+Example usage:
+
+```
+PCollection<KV<String, String>> leftPcollection = ...
+PCollection<KV<String, Long>> rightPcollection = ...
+
+PCollection<KV<String, KV<String, Long>>> joinedPcollection =
+  Join.innerJoin(leftPcollection, rightPcollection);
+```
+
+
+## <a name="sorter"></a>Sorter
+
+This module provides the `SortValues` transform, which takes a `PCollection<KV<K, Iterable<KV<K2,
V>>>>` and produces a `PCollection<KV<K, Iterable<KV<K2, V>>>>`
where, for each primary key `K` the paired `Iterable<KV<K2, V>>` has been sorted
by the byte encoding of secondary key (`K2`). It is an efficient and scalable sorter for iterables,
even if they are large (do not fit in memory).
+
+### Caveats
+
+- This transform performs value-only sorting; the iterable accompanying each key is sorted,
but *there is no relationship between different keys*, as Beam does not support any defined
relationship between different elements in a `PCollection`.
+* Each `Iterable<KV<K2, V>>` is sorted on a single worker using local memory
and disk. This means that `SortValues` may be a performance and/or scalability bottleneck
when used in different pipelines. For example, users are discouraged from using `SortValues`
on a `PCollection` of a single element to globally sort a large `PCollection`. A (rough) estimate
of the number of bytes of disk space utilized if sorting spills to disk is `numRecords * (numSecondaryKeyBytesPerRecord
+ numValueBytesPerRecord + 16) * 3`.
+
+### Options
+
+* The user can customize the temporary location used if sorting requires spilling to disk
and the maximum amount of memory to use by creating a custom instance of `BufferedExternalSorter.Options`
to pass into `SortValues.create`.
+
+### Example usage of `SortValues`
+
+```
+PCollection<KV<String, KV<String, Integer>>> input = ...
+
+// Group by primary key, bringing <SecondaryKey, Value> pairs for the same key together.
+PCollection<KV<String, Iterable<KV<String, Integer>>>> grouped =
+    input.apply(GroupByKey.<String, KV<String, Integer>>create());
+
+// For every primary key, sort the iterable of <SecondaryKey, Value> pairs by secondary
key.
+PCollection<KV<String, Iterable<KV<String, Integer>>>> groupedAndSorted
=
+    grouped.apply(
+        SortValues.<String, String, Integer>create(new BufferedExternalSorter.Options()));
+```

http://git-wip-us.apache.org/repos/asf/beam-site/blob/8ed3f247/src/documentation/sdks/java.md
----------------------------------------------------------------------
diff --git a/src/documentation/sdks/java.md b/src/documentation/sdks/java.md
index 3275914..7cfe96f 100644
--- a/src/documentation/sdks/java.md
+++ b/src/documentation/sdks/java.md
@@ -23,3 +23,11 @@ The Java SDK supports all features currently supported by the Beam model.
 
 ## Pipeline I/O
 See the [Beam-provided I/O Transforms]({{site.baseurl }}/documentation/io/built-in/) page
for a list of the currently available I/O transforms.
+
+
+## Extensions
+
+The Java SDK has the following extensions:
+
+- [join-library]({{site.baseurl}}/documentation/sdks/java-extensions/#join-library) provides
inner join, outer left join, and outer right join functions.
+- [sorter]({{site.baseurl}}/documentation/sdks/java-extensions/#sorter) is an efficient and
scalable sorter for large iterables.


Mime
View raw message