flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-2790) Add high availability support for Yarn
Date Fri, 02 Oct 2015 12:39:26 GMT

    [ https://issues.apache.org/jira/browse/FLINK-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14941081#comment-14941081

ASF GitHub Bot commented on FLINK-2790:

GitHub user tillrohrmann opened a pull request:


    [FLINK-2790] [yarn] [ha] Adds high availability support for Yarn

    Adds high availability support for Yarn by exploiting Yarn's functionality to restart
a failed application master. Depending on the Hadoop version the behaviour is an increasing
superset of functionalities of the preceding version's behaviour
    ###2.3.0 <= version < 2.4.0
    * Set the number of application attempts to the configuration value `yarn.application-attempts`.
This means that the application can be restarted `yarn.application-attempts` time before yarn
fails the application. In case of an application master, all other task manager containers
will also be killed.
    ### 2.4.0 <= version < 2.6.0
    * Additionally, enables that containers will be kept across application attempts. This
avoids the killing of TaskManager containers in the case of an application master failure.
This has the advantage that the startup time is faster and that the user does not have to
wait for obtaining the container resources again.
    ### 2.6.0 <= version
    * Sets the attempts failure validity interval to the akka timeout value. The attempts
failure validity interval says that an application is only killed after the system has seen
the maximum number of application attempts during one interval. This avoids that a long lasting
job will deplete it's application attempts.
    This PR also refactors the different Yarn components to allow the start-up of testing
actors within Yarn. Furthermore, the `JobManager` start up logic is slightly extended to allow
code reuse in the `ApplicationMasterBase`.
    The HA functionality is tested via the `YARNHighAvailabilityITCase` where an application
master is multiple times killed. Each time it's checked that the single TaskManager successfully
reconnects to the newly started `YarnJobManager`. In case of version `2.3.0`, the `TaskManager`
is restarted.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink yarnHA

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1213
commit 1a18172ae69eb576638704f8e143a921aa8630d5
Author: Till Rohrmann <trohrmann@apache.org>
Date:   2015-09-01T14:35:48Z

    [FLINK-2790] [yarn] [ha] Adds high availability support for Yarn

commit 5359676556d16610303d4f36fcbe5320ef4e6643
Author: Till Rohrmann <trohrmann@apache.org>
Date:   2015-09-23T15:42:57Z

    Refactors JobManager's start actors method to be reusable

commit d6a47cd8ad265c5122d1a67c09773cbc5a491261
Author: Till Rohrmann <trohrmann@apache.org>
Date:   2015-09-24T12:55:18Z

    Yarn refactoring to introduce yarn testing functionality

commit f9578f136dd41cd9829d712f7c62a59c9ea8e194
Author: Till Rohrmann <trohrmann@apache.org>
Date:   2015-09-28T16:21:30Z

    Added support for testing yarn cluster. Extracted JobManager's and TaskManager's testing
messages into stackable traits.

commit dbfa16438ad9d7d61e8d1a582c8cd1de9352078e
Author: Till Rohrmann <trohrmann@apache.org>
Date:   2015-09-29T15:05:01Z

    Implemented YarnHighAvailabilityITCase using Akka messages for synchronization.


> Add high availability support for Yarn
> --------------------------------------
>                 Key: FLINK-2790
>                 URL: https://issues.apache.org/jira/browse/FLINK-2790
>             Project: Flink
>          Issue Type: Sub-task
>          Components: JobManager, TaskManager
>            Reporter: Till Rohrmann
>             Fix For: 0.10
> Add master high availability support for Yarn. The idea is to let Yarn restart a failed
application master in a new container. For that, we set the number of application retries
to something greater than 1. 
> From version 2.4.0 onwards, it is possible to reuse already started containers for the
TaskManagers, thus, avoiding unnecessary restart delays.
> From version 2.6.0 onwards, it is possible to specify an interval in which the number
of application attempts have to be exceeded in order to fail the job. This will prevent long
running jobs from eventually depleting all available application attempts.

This message was sent by Atlassian JIRA

View raw message