flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1376) SubSlots are not properly released in case that a TaskManager fatally fails, leaving the system in a corrupted state
Date Tue, 03 Feb 2015 09:27:35 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302997#comment-14302997
] 

ASF GitHub Bot commented on FLINK-1376:
---------------------------------------

Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/317#issuecomment-72618925
  
    I think this is a good fix, overall. There is one issue I would really like to fix, and
that is the serializability of the `Instance` class. This class is not meant to be serialized
and moved around, which can be reflected by the fact that it holds an Actor Ref, and the necessity
to make a lot of the fields transient.
    
    I assume that the instance needs to be serialized as part of the ExechutionGraph archiving,
where the ExecutionGraph is sent via an actor message to the archiver.
    
    I would like to solve that differently. The execution graph is "cleaned" before archiving
(see #344 ) to reduce memory footprint. At this point, I would replace the `Instance` in the
Executions with the `Instance Connection Info`, which holds all info necessary. Then we won't
have to send instances through actor messages, which would be the cleaner solution.


> SubSlots are not properly released in case that a TaskManager fatally fails, leaving
the system in a corrupted state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-1376
>                 URL: https://issues.apache.org/jira/browse/FLINK-1376
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>
> In case that the TaskManager fatally fails and some of the failing node's slots are SharedSlots,
then the slots are not properly released by the JobManager. This causes that the corresponding
job will not be properly failed, leaving the system in a corrupted state.
> The reason for that is that the AllocatedSlot is not aware of being treated as a SharedSlot
and thus he cannot release the associated SubSlots.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message