beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amit Sela (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows
Date Wed, 19 Oct 2016 17:58:58 GMT

    [ https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15589398#comment-15589398
] 

Amit Sela edited comment on BEAM-696 at 10/19/16 5:58 PM:
----------------------------------------------------------

I'm not sure about the Direct/Dataflow runners, but as for Flink, I remember Aljoscha said
the runner chooses not to "pre-combine" for ALL merging windows - even if no sideInputs are
defined.
So it's not accurate to say that there's no real sacrifice of optimization - for all merging
windows with no sideInputs (or sideInputs that are "agnostic" to merging) there is sacrifice
for nothing, or better, for the case where sideInputs are used and they should differ until
trigger. 
For example: all pipelines that use Sessions without SideInputs will suffer degradation in
performance just to keep the model satisfied while it should be the other way around.
As for the Spark runner, I will gladly ignore the optimization for merging windows - it's
easier for me, the implementation based on {{GroupByKey}} followed by {{GroupAlsoByWindow}}
and {{Combine.GroupedValues}} is very straight-forward and fully implemented in the runner
for sometime now.


was (Author: amitsela):
I'm not sure about the Direct/Dataflow runners, but as for Flink, Aljoscha clearly said the
runner chooses not to "pre-combine" for ALL merging windows - even if no sideInputs are defined.
So it's not accurate to say that there's no real sacrifice of optimization - for all merging
windows with no sideInputs (or sideInputs that are "agnostic" to merging) there is sacrifice
for nothing, or better, for the case where sideInputs are used and they should differ until
trigger. 
For example: all pipelines that use Sessions without SideInputs will suffer degradation in
performance just to keep the model satisfied while it should be the other way around.
As for the Spark runner, I will gladly ignore the optimization for merging windows - it's
easier for me, the implementation based on {{GroupByKey}} followed by {{GroupAlsoByWindow}}
and {{Combine.GroupedValues}} is very straight-forward and fully implemented in the runner
for sometime now.

> Side-Inputs non-deterministic with merging main-input windows
> -------------------------------------------------------------
>
>                 Key: BEAM-696
>                 URL: https://issues.apache.org/jira/browse/BEAM-696
>             Project: Beam
>          Issue Type: Bug
>          Components: beam-model
>            Reporter: Ben Chambers
>            Assignee: Pei He
>
> Side-Inputs are non-deterministic for several reasons:
> 1. Because they depend on triggering of the side-input (this is acceptable because triggers
are by their nature non-deterministic).
> 2. They depend on the current state of the main-input window in order to lookup the side-input.
This means that with merging
> 3. Any runner optimizations that affect when the side-input is looked up may cause problems
with either or both of these.
> This issue focuses on #2 -- the non-determinism of side-inputs that execute within a
Merging WindowFn.
> Possible solution would be to defer running anything that looks up the side-input until
we need to extract an output, and using the main-window at that point. Specifically, if the
main-window is a MergingWindowFn, don't execute any kind of pre-combine, instead buffer all
the inputs and combine later.
> This could still run into some non-determinism if there are triggers controlling when
we extract output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message