flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Richter (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (FLINK-9635) Local recovery scheduling can cause spread out of tasks
Date Tue, 30 Oct 2018 17:38:01 GMT

     [ https://issues.apache.org/jira/browse/FLINK-9635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Stefan Richter closed FLINK-9635.
    Resolution: Fixed

Merged in:
master: c8a6471a58

> Local recovery scheduling can cause spread out of tasks
> -------------------------------------------------------
>                 Key: FLINK-9635
>                 URL: https://issues.apache.org/jira/browse/FLINK-9635
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0, 1.6.2
>            Reporter: Till Rohrmann
>            Assignee: Stefan Richter
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.7.0
> In order to make local recovery work, Flink's scheduling was changed such that it tries
to be rescheduled to its previous location. In order to not occupy slots which have state
of other tasks cached, the strategy will request a new slot if the old slot identified by
the previous allocation id is no longer present. This also applies to newly allocated slots
because there is no distinction between new or already used. This behaviour can cause that
every tasks gets deployed to its own slot if the {{SlotPool}} has released all slots in the
meantime, for example. The consequence could be that a job can no longer be executed after
a failure because it needs more slots than before.

This message was sent by Atlassian JIRA

View raw message