beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kenneth Knowles (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-1372) OutputTimeFn and Accumulating Mode is Confusing
Date Wed, 01 Feb 2017 22:25:52 GMT

    [ https://issues.apache.org/jira/browse/BEAM-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15849036#comment-15849036
] 

Kenneth Knowles commented on BEAM-1372:
---------------------------------------

Nice. This is interesting and a real model issue.

This isn't actually a problem with {{OutputTimeFn}} per se. In its absence, you'd just have
to choose a default, which used to be the MIN timestamp, which would exercise this problem.

It makes sense that the OutputTimeFn (or equivalent) should consider _all_ data, not just
the most recently buffered, when accumulating fired panes.

> OutputTimeFn and Accumulating Mode is Confusing
> -----------------------------------------------
>
>                 Key: BEAM-1372
>                 URL: https://issues.apache.org/jira/browse/BEAM-1372
>             Project: Beam
>          Issue Type: Bug
>          Components: beam-model
>            Reporter: Thomas Groh
>
> See [here| https://github.com/tgroh/beam/commit/2238df334a368ce1a41e14ee616be954c5430c73]
for an example pipeline
> The Timestamp used by a pane does not change based on the accumulation mode of the windowing
strategy - as a result, elements which have associated timestamps can not be safely reassigned
to those timestamps after a GroupByKey if more than one pane could have been produced, regardless
of the {{OutputTimeFn}}. The first example pipeline demonstrates two PCollections where the
elements within the last PCollection cannot be reassigned to their timestamps, even though
we are using {{OutputTimeFn#outputAtEarliestInputTimestamp}} and 
> When using a more complex windowing strategy like sessions, this is even more confusing
- a session that spans more than one of the downstream windows but that is produced in multiple
panes will over time be assigned to later and later windows as more panes are produced - thus,
a pipeline that produces session windows and wishes to group the sessions by the point at
which they started must only ever produce a single pane per session.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message