flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-9352) In Standalone checkpoint recover mode many jobs with same checkpoint interval cause IO pressure
Date Wed, 04 Jul 2018 09:36:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532533#comment-16532533

ASF GitHub Bot commented on FLINK-9352:

Github user tillrohrmann commented on a diff in the pull request:

    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
    @@ -1173,9 +1179,10 @@ public void startCheckpointScheduler() {
     			periodicScheduling = true;
    +			long initialDelay = schedulerInitialDelayGenerator.nextLong(
    +				minPauseBetweenCheckpointsNanos / 1_000_000, baseInterval);
    --- End diff --
    Could we replace `schedulerInitialDelayGenerator` with `long initialDelay = ThreadLocalRandom.current().nextLong(minPauseBetweenCheckpointsNanos
/ 1_000_000, baseInterval);`? That way we would not have to use `RandomUtils`.

> In Standalone checkpoint recover mode many jobs with same checkpoint interval cause IO
> -----------------------------------------------------------------------------------------------
>                 Key: FLINK-9352
>                 URL: https://issues.apache.org/jira/browse/FLINK-9352
>             Project: Flink
>          Issue Type: Improvement
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.5.0, 1.4.2, 1.6.0
>            Reporter: vinoyang
>            Assignee: vinoyang
>            Priority: Major
>              Labels: pull-request-available
> currently, the periodic checkpoint coordinator startCheckpointScheduler uses *baseInterval*
as the initialDelay parameter. the *baseInterval* is also the checkpoint interval. 
> In standalone checkpoint mode, many jobs config the same checkpoint interval. When all
jobs being recovered (the cluster restart or jobmanager leadership switched), all jobs'
checkpoint period will tend to accordance. All jobs' CheckpointCoordinator would start
and trigger in a approximate time point.
> This caused the high IO cost in the same time period in our production scenario.
> I suggest let the scheduleAtFixedRate's initial delay parameter as a API config which
can let user scatter checkpoint in this scenario.
> cc [~StephanEwen] [~Zentol]

This message was sent by Atlassian JIRA

View raw message