flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aljoscha <...@git.apache.org>
Subject [GitHub] flink pull request #3191: [FLINK-5529] [FLINK-4752] [docs] Improve / extends...
Date Wed, 25 Jan 2017 17:32:12 GMT
Github user aljoscha commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3191#discussion_r97824676
  
    --- Diff: docs/dev/windows.md ---
    @@ -23,133 +23,96 @@ specific language governing permissions and limitations
     under the License.
     -->
     
    -Flink uses a concept called *windows* to divide a (potentially) infinite `DataStream`
into finite
    -slices based on the timestamps of elements or other criteria. This division is required
when working
    -with infinite streams of data and performing transformations that aggregate elements.
    -
    -<span class="label label-info">Info</span> We will mostly talk about *keyed
windowing* here, i.e.
    -windows that are applied on a `KeyedStream`. Keyed windows have the advantage that elements
are
    -subdivided based on both window and key before being given to
    -a user function. The work can thus be distributed across the cluster
    -because the elements for different keys can be processed independently. If you absolutely
have to,
    -you can check out [non-keyed windowing](#non-keyed-windowing) where we describe how non-keyed
    -windows work.
    +Windows are at the heart of processing infinite streams. Windows split the stream into
"buckets" of finite size, 
    +over which we can apply computations. This document focuses on how windowing is performed
in Flink and how the 
    +programmer can benefit to the maximum from its offered functionality. 
     
    -* This will be replaced by the TOC
    -{:toc}
    +The general structure of a windowed Flink program is presented below. This is also going
to serve as a roadmap for 
    +the rest of the page.
     
    -## Basics
    +    stream
    +           .keyBy(...)          <-  keyed versus non-keyed windows
    +           .window(...)         <-  required: "assigner"
    +          [.trigger(...)]       <-  optional: "trigger" (else default trigger)
    +          [.evictor(...)]       <-  optional: "evictor" (else no evictor)
    +          [.allowedLateness()]  <-  optional, else zero
    +           .reduce/fold/apply() <-  required: "function"
     
    -For a windowed transformation you must at least specify a *key*
    -(see [specifying keys]({{ site.baseurl }}/dev/api_concepts.html#specifying-keys)),
    -a *window assigner* and a *window function*. The *key* divides the infinite, non-keyed,
stream
    -into logical keyed streams while the *window assigner* assigns elements to finite per-key
windows.
    -Finally, the *window function* is used to process the elements of each window.
    +In the above, the commands in square brackets ([...]) are optional. This reveals that
Flink allows you to customize your 
    +windowing logic in many different ways so that it best fits your needs. 
     
    -The basic structure of a windowed transformation is thus as follows:
    +* This will be replaced by the TOC
    +{:toc}
     
    -<div class="codetabs" markdown="1">
    -<div data-lang="java" markdown="1">
    -{% highlight java %}
    -DataStream<T> input = ...;
    +## Window Lifecycle
     
    -input
    -    .keyBy(<key selector>)
    -    .window(<window assigner>)
    -    .<windowed transformation>(<window function>);
    -{% endhighlight %}
    -</div>
    +In a nutshell, a window is **created** as soon as the first element that should belong
to this window arrives, and the  
    +window is **completely removed** when the time (event or processing time) passes its
end timestamp plus the user-specified 
    +`allowed lateness` (see [Allowed Lateness](#allowed-lateness)). Flink guarantees removal
only for time-based 
    +windows and not for other types, *e.g.* global windows (see [Window Assigners](#window-assigners)).
For example, with an 
    +event-time-based windowing strategy that creates non-overlapping (or tumbling) windows
every 5 minutes and has an allowed 
    +lateness of 1 min, Flink will create a new window for the interval between `12:00` and
`12:05` when the first element with 
    +a timestamp that falls into this interval arrives, and it will remove it when the watermark
passes the `12:06`
    +timestamp. 
     
    -<div data-lang="scala" markdown="1">
    -{% highlight scala %}
    -val input: DataStream[T] = ...
    +In addition, each window will have a `Trigger` (see [Triggers](#triggers)) and a function
(`WindowFunction`, `ReduceFunction` or 
    +`FoldFunction`) (see [Window Functions](#window-functions)) attached to it. The function
will contain the computation to 
    +be applied to the contents of the window, while the `Trigger` specifies the conditions
under which the window is 
    +considered ready for the function to be applied. A triggering policy might be something
like "when the number of elements 
    +in the window is more than 4", or "when the watermark passes the end of the window".
A trigger can also decide to 
    +purge a window's contents any time between its creation and removal. Purging in this
case only refers to the elements 
    +in the window, and *not* the window metadata. This means that new data can still be added
to that window.
     
    -input
    -    .keyBy(<key selector>)
    -    .window(<window assigner>)
    -    .<windowed transformation>(<window function>)
    -{% endhighlight %}
    -</div>
    -</div>
    +Apart from the above, you can specify an `Evictor` (see [Evictors](#evictors)) which
will be able to remove  
    +elements from the window after the trigger fires and before and/or after the function
is applied.
     
    -We will cover [window assigners](#window-assigners) in a separate section below.
    +In the following we go into more detail for each of the components above. We start with
the required parts in the above 
    +snippet (see [Keyed vs Non-Keyed Windows](#keyed-vs-non-keyed-windows), [Window Assigner](#window-assigner),
and 
    +[Window Function](#window-function)) before moving to the optional ones.
     
    -The window transformation can be one of `reduce()`, `fold()` or `apply()`. Which respectively
    -takes a `ReduceFunction`, `FoldFunction` or `WindowFunction`. We describe each of these
ways
    -of specifying a windowed transformation in detail below: [window functions](#window-functions).
    +## Keyed vs Non-Keyed Windows
     
    -For more advanced use cases you can also specify a `Trigger` that determines when exactly
a window
    -is being considered as *ready for processing*. These will be covered in more detail in
    -[triggers](#triggers).
    +The first thing to specify is whether your stream should be keyed or not. This has to
be done before defining the window. 
    +Using the `keyBy(...)` will split your infinite stream into logical keyed streams. If
`keyBy(...)` is not called, your 
    +stream is not keyed.
     
    -## Window Assigners
    +In the case of keyed streams, any attribute of your incoming events can be used as a
key 
    +(more details [here]({{ site.baseurl }}/dev/api_concepts.html#specifying-keys)). Having
a keyed stream will 
    +allow your windowed computation to be performed in parallel by multiple tasks, as each
logical keyed stream can be processed 
    +independently from the rest. All elements referring to the same key will be sent to the
same parallel task. 
     
    -The window assigner specifies how elements of the stream are divided into finite slices.
Flink comes
    -with pre-implemented window assigners for the most typical use cases, namely *tumbling
windows*,
    -*sliding windows*, *session windows* and *global windows*, but you can implement your
own by
    -extending the `WindowAssigner` class. All the built-in window assigners, except for the
global
    -windows one, assign elements to windows based on time, which can either be processing
time or event
    -time. Please take a look at our section on [event time]({{ site.baseurl }}/dev/event_time.html)
for more
    -information about how Flink deals with time.
    +In case of non-keyed streams, your original stream will not be split into multiple logical
streams and all the windowing logic 
    +will be performed by a single task, *i.e.* with parallelism of 1.
     
    -Let's first look at how each of these window assigners works before looking at how they
can be used
    -in a Flink program. We will be using abstract figures to visualize the workings of each
assigner:
    -in the following, the purple circles are elements of the stream, they are partitioned
    -by some key (in this case *user 1*, *user 2* and *user 3*) and the x-axis shows the progress
    -of time.
    +## Window Assigners
     
    -### Global Windows
    +After specifying whether your stream is keyed or not, the next step is to define a *windowing
strategy*. 
    --- End diff --
    
    I think we should stick to `window assigner` here because that's what we're describing.
In my mind, the ensemble of window assigner, trigger (and evictor) is actually the `windowing
strategy` since only those together define what happens in the end.
    
    What do you think? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message