flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "vinoyang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues
Date Thu, 01 Mar 2018 09:51:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381763#comment-16381763
] 

vinoyang commented on FLINK-5621:
---------------------------------

Hi [~till.rohrmann] what's your opinion about this idea. Since Flink 1.5+, it's local recovery
feature produced snapshot may also trigger the disk space insufficient frequently. If we
collect task managers' metrics and mark them as some rules. The resource manager can consider
these taskamangers as 'dangerous'. Then the scheduler can avoid these tms.

> Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational
issues
> ----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-5621
>                 URL: https://issues.apache.org/jira/browse/FLINK-5621
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.4
>            Reporter: Jamie Grier
>            Priority: Critical
>
> There are cases where jobs can get into a state where no progress can be made if there
is something pathologically wrong with one of the TaskManager nodes in the cluster.
> An example of this would be a TaskManager on a machine that runs out of disk space. 
Flink never considers the TM to be "bad" and will keep using it to attempt to run tasks --
which will continue to fail.
> A suggestion for overcoming this would be to allow an option where a TM will commit suicide
if that TM was the source of an exception that caused a job to fail/restart.
> I'm sure there are plenty of other approaches to solving this..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message