uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Challenger (JIRA)" <...@uima.apache.org>
Subject [jira] [Updated] (UIMA-2772) DUCC resource manager - Restart and fast-start
Date Mon, 05 May 2014 15:10:16 GMT

     [ https://issues.apache.org/jira/browse/UIMA-2772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jim Challenger updated UIMA-2772:
---------------------------------

    Affects Version/s: 1.0.0-Ducc
        Fix Version/s: 1.0.0-Ducc

> DUCC resource manager - Restart and fast-start
> ----------------------------------------------
>
>                 Key: UIMA-2772
>                 URL: https://issues.apache.org/jira/browse/UIMA-2772
>             Project: UIMA
>          Issue Type: Bug
>          Components: DUCC
>    Affects Versions: 1.0.0-Ducc
>            Reporter: Jim Challenger
>            Assignee: Jim Challenger
>             Fix For: 1.0.0-Ducc
>
>
> Currently RM waits a "reasonable time" (init-stabiity) on startup to allow nodes to check
in, before accepting scheduling requests.  It is not possible to know exactly how long to
wait, making init-stability a heuristic.  For normal startup this is not a problem.  If RM
is restarting 'hot', or if the orchestrator publishes non-preemptable jobs on restart, and
the necessary nodes have not arrived by the completion of init-stability wait, this can cause
many problems: over-commitment, under-commitment, and in some cases  inconsistent state (and
crashes).
> To remedy this, RM will include the full Node object in its publications to the OR, which
will echo them back for work that it believes to be active. On startup RM can fully reconstruct
state as of its last publication from this, eliminating the problem. A side-effect of this
is that RM need not wait for nodes to check in, significantly decreasing its startup time.
 If nodes added to the resource pool in this way never check in, the normal "dead node" mechanism
will kick in, maintaining consistency.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message