beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sam McVeety (JIRA)" <>
Subject [jira] [Commented] (BEAM-758) Per-step, per-execution nonce
Date Sat, 04 Feb 2017 21:19:51 GMT


Sam McVeety commented on BEAM-758:

After further conversations, it seems that ProcessContext is a more desirable place to put
this.  As an amended proposal, what do folks think about proving the following in ProcessContext:

/** Provides a nonce that is unique and stable for this job execution instance. **/
String getJobNonce();

/** Provides a nonce that is unique and stable for this step. **/
String getStepId();

Between the two, these allow for both stable, shared values across multiple steps as needed,
as well as step-unique values.

> Per-step, per-execution nonce
> -----------------------------
>                 Key: BEAM-758
>                 URL:
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-core
>    Affects Versions: Not applicable
>            Reporter: Daniel Halperin
>            Assignee: Sam McVeety
> In the forthcoming runner API, a user will be able to save a pipeline to JSON and then
run it repeatedly.
> Many pieces of code (e.g., BigQueryIO.Read or Write) rely on a single random value (nonce).
These values are typically generated at apply time, so that they are deterministic (don't
change across retries of DoFns) and global (are the same across all workers).
> However, once the runner API lands the existing code would result in the same nonce being
reused across jobs. Other possible solutions:
> * Generate nonce in {{Create(1) | ParDo}} then use this as a side input. Should work,
as along as side inputs are actually checkpointed. But does not work for {{BoundedSource}}.
> * If a nonce is only needed for the lifetime of one bundle, can be generated in {{startBundle}}
and used in {{finishBundle}} [or {{tearDown}}].
> * Add some context somewhere that lets user code access unique step name, and somehow
generate a nonce consistently e.g. by hashing. Will usually work, but this is similarly not
available to sources.
> Another Q: I'm not sure we have a good way to generate nonces in unbounded pipelines
-- we probably need one. This would enable us to, e.g., use {{BigQueryIO.Write}} in an unbounded
pipeline [if we had, e.g., exactly-once triggering per window]. Or generalizing to multiple

This message was sent by Atlassian JIRA

View raw message