flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-9910) Non-queued scheduling failure sometimes does not return the slot
Date Sun, 22 Jul 2018 19:54:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552138#comment-16552138
] 

ASF GitHub Bot commented on FLINK-9910:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/6385

    [FLINK-9910][scheduling] Execution#scheduleForeExecution does not cancel slot future

    ## What is the purpose of the change
    
    In order to properly give back an allocated slot to the SlotPool, one must not complete
    the result future of Execution#allocateAndAssignSlotForExecution. This commit changes
the
    behaviour in Execution#scheduleForExecution accordingly.
    
    This PR is based on #6384.
    
    ## Verifying this change
    
    - Added `ExecutionTest#testEagerSchedulingFailureReturnsSlot`
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing,
Yarn/Mesos, ZooKeeper: (yes)
      - The S3 file system connector: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink improveExecutionVertexFailureHandling

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/6385.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6385
    
----
commit f85ec37cc3ad21998eabad45a6dcb46e8efc62fb
Author: Till Rohrmann <trohrmann@...>
Date:   2018-07-19T11:07:44Z

    [FLINK-9838][logging] Don't log slot request failures on the ResourceManager

commit 7c703fb3b350ef5b02b01d621c3a16d4bca6f707
Author: Till Rohrmann <trohrmann@...>
Date:   2018-07-19T11:41:03Z

    [hotfix] Improve logging of SlotPool and SlotSharingManager

commit 414a8d231a5b6cdc2d5db0c1d35a79ff584c1cd0
Author: Till Rohrmann <trohrmann@...>
Date:   2018-07-22T18:05:05Z

    [FLINK-9908][scheduling] Do not cancel individual scheduling future
    
    Since the individual scheduling futures contain logic to release the slot if it cannot
    be assigned to the Execution, we must not cancel them. Otherwise we might risk that
    slots are not returned to the SlotPool leaving it in an inconsistent state.

commit 8f4471339db3a2df01c1cc61e03eb0881f98dd4f
Author: Till Rohrmann <trohrmann@...>
Date:   2018-07-22T18:17:11Z

    [FLINK-9909][core] ConjunctFuture does not cancel input futures
    
    If a ConjunctFuture is cancelled, then it won't cancel all of its input
    futures automatically. If the users needs this behaviour then he has to
    implement it explicitly. The reason for this change is that an implicit
    cancellation can have unwanted side effects, because all of the cancelled
    input futures' producers won't be executed.

commit c606145182c0531a8239decdc52ceeccdb81ca73
Author: Till Rohrmann <trohrmann@...>
Date:   2018-07-22T18:20:53Z

    [hotfix] Fix checkstyle violations in FutureUtils

commit c296d8b146cd08367329226b9ecaa28bd86ba1ed
Author: Till Rohrmann <trohrmann@...>
Date:   2018-07-22T18:34:33Z

    [hotfix] Replace check state condition in Execution#tryAssignResource with if check
    
    Instead of risking an IllegalStateException it is better to check that the
    taskManagerLocationFuture has not been completed yet. If, then we also reject
    the assignment of the LogicalSlot to the Execution. That way, we don't risk
    that we don't release the slot in case of an exception in
    Execution#allocateAndAssignSlotForExecution.

commit 69b8c7c7b5905be83c7c393423c064de9b78375f
Author: Till Rohrmann <trohrmann@...>
Date:   2018-07-22T18:43:44Z

    [hotfix] Fix checkstyle violations in ExecutionVertex

commit 6e018cfdf84192041a4b1ba27dcbdbf645e8d40b
Author: Till Rohrmann <trohrmann@...>
Date:   2018-07-22T18:46:37Z

    [hotfix] Fix checkstyle violations in ExecutionJobVertex

commit f8805be13d2c0c2da58e0e7ecc6dc102953fc0c5
Author: Till Rohrmann <trohrmann@...>
Date:   2018-07-22T18:48:53Z

    [hotfix] Fix checkstyle violations in Execution

commit 0e9fbf8157e45d260a1a418c25871031a98a4995
Author: Till Rohrmann <trohrmann@...>
Date:   2018-07-22T19:38:42Z

    [FLINK-9910][scheduling] Execution#scheduleForeExecution does not cancel slot future
    
    In order to properly give back an allocated slot to the SlotPool, one must not complete
    the result future of Execution#allocateAndAssignSlotForExecution. This commit changes
the
    behaviour in Execution#scheduleForExecution accordingly.

----


> Non-queued scheduling failure sometimes does not return the slot
> ----------------------------------------------------------------
>
>                 Key: FLINK-9910
>                 URL: https://issues.apache.org/jira/browse/FLINK-9910
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.1, 1.6.0, 1.7.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.5.2, 1.6.0
>
>
> Similar to FLINK-9908, it can happen that in case of a non-queued scheduling failure
a slot is not properly returned to the {{SlotPool}}. The reason for the failure seems to be
the exceptional completion of the {{allocationFuture}} in {{Execution#scheduleForExecution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message