hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gergo Repas (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-7913) Improve error handling when application recovery fails with exception
Date Fri, 09 Feb 2018 11:00:00 GMT
Gergo Repas created YARN-7913:
---------------------------------

             Summary: Improve error handling when application recovery fails with exception
                 Key: YARN-7913
                 URL: https://issues.apache.org/jira/browse/YARN-7913
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: resourcemanager
    Affects Versions: 3.0.0
            Reporter: Gergo Repas
            Assignee: Gergo Repas


There are edge cases when the application recovery fails with an exception.

Example failure scenario:
 * setup: a queue is a leaf queue in the primary RM's config and the same queue is a parent
queue in the secondary RM's config.
 * When failover happens with this setup, the recovery will fail for applications on this
queue, and an APP_REJECTED event will be dispatched to the async dispatcher. On the same thread
(that handles the recovery), a NullPointerException is thrown when the applicationAttempt
is tried to be recovered (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
I don't see a good way to avoid the NPE in this scenario, because when the NPE occurs the
APP_REJECTED has not been processed yet, and we don't know that the application recovery failed.

Currently the first exception will abort the recovery, and if there are X applications, there
will be ~X passive -> active RM transition attempts - the passive -> active RM transition
will only succeed when the last APP_REJECTED event is processed on the async dispatcher thread.

_The point of this ticket is to improve the error handling and reduce the number of passive
-> active RM transition attempts (solving the above described failure scenario isn't in
scope)._



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org


Mime
View raw message