flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elias Levy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-6243) Continuous Joins: True Sliding Window Joins
Date Thu, 02 Nov 2017 22:50:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-6243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236751#comment-16236751

Elias Levy commented on FLINK-6243:

[~StephanEwen] anything to review?

> Continuous Joins:  True Sliding Window Joins
> --------------------------------------------
>                 Key: FLINK-6243
>                 URL: https://issues.apache.org/jira/browse/FLINK-6243
>             Project: Flink
>          Issue Type: New Feature
>          Components: DataStream API
>    Affects Versions: 1.1.4
>            Reporter: Elias Levy
>            Priority: Major
> Flink defines sliding window joins as the join of elements of two streams that share
a window of time, where the windows are defined by advancing them forward some amount of time
that is less than the window time span.  More generally, such windows are just overlapping
hopping windows. 
> Other systems, such as Kafka Streams, support a different notion of sliding window joins.
 In these systems, two elements of a stream are joined if the absolute time difference between
the them is less or equal the time window length.
> This alternate notion of sliding window joins has some advantages in some applications
over the current implementation.  
> Elements to be joined may both fall within multiple overlapping sliding windows, leading
them to be joined multiple times, when we only wish them to be joined once.
> The implementation need not instantiate window objects to keep track of stream elements,
which becomes problematic in the current implementation if the window size is very large and
the slide is very small.
> It allows for asymmetric time joins.  E.g. join if elements from stream A are no more
than X time behind and Y time head of an element from stream B.
> It is currently possible to implement a join with these semantics using {{CoProcessFunction}},
but the capability should be a first class feature, such as it is in Kafka Streams.
> To perform the join, elements of each stream must be buffered for at least the window
time length.  To allow for large window sizes and high volume of elements, the state, possibly
optionally, should be buffered such as it can spill to disk (e.g. by using RocksDB).
> The same stream may be joined multiple times in a complex topology.  As an optimization,
it may be wise to reuse any element buffer among colocated join operators.  Otherwise, there
may write amplification and increased state that must be snapshotted.

This message was sent by Atlassian JIRA

View raw message