flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Seth Wiesman (Jira)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-22587) Support aggregations in batch mode with DataStream API
Date Thu, 27 May 2021 15:49:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17352575#comment-17352575

Seth Wiesman commented on FLINK-22587:

[~echauchot] That makes sense. You can just assign a static timestamp to each record and it
should work. Watermarks are disregarded in batch mode anyway so the strategy doesn't matter.
The preferable other option is to just do a join manually with a KeyedCoProcessFunction, the
reason why is I think the window join operator in DataStream will eventually be deprecated
for all the reasons you've discovered. 


FWIW I would also like to deprecate custom triggers and window assigners, I work with a lot
of Flink users and have found usage of non-standard window constructs to be an antipattern.
The code ends up being convoluted and (Keyed)ProcessFunction always ends up as a cleaner,
more maintainable solution.

> Support aggregations in batch mode with DataStream API
> ------------------------------------------------------
>                 Key: FLINK-22587
>                 URL: https://issues.apache.org/jira/browse/FLINK-22587
>             Project: Flink
>          Issue Type: Bug
>          Components: API / DataStream
>    Affects Versions: 1.12.0, 1.13.0
>            Reporter: Etienne Chauchot
>            Priority: Major
> A pipeline like this *in batch mode* would output no data
> {code:java}
> stream.join(otherStream)
>     .where(<KeySelector>)
>     .equalTo(<KeySelector>)
>     .window(GlobalWindows.create())
>     .apply(<JoinFunction>)
> {code}
> Indeed the default trigger for GlobalWindow is NeverTrigger which never fires. If we
set a _EventTimeTrigger_ it will fire with every element as the watermark will be set to +INF
(batch mode) and will pass the end of the global window with each new element. A _ProcessingTimeTrigger_ never
fires either and all elapsed time or delta based triggers would not be suited for batch.
> Same goes for _reduce()_ instead of join().
> So I guess we miss something for batch support with DataStream.

This message was sent by Atlassian Jira

View raw message